<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/Contextual_Keyword_Analysis_and_Optimization_Using_Word2Vec_and_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Name :- Contextual Keyword Analysis and Optimization Using Word2Vec and Web Scraping
  

**Summary:-** This project focuses on analyzing and optimizing keywords for a website using a technique called Word2Vec and web scraping. The goal is to improve the website's search engine ranking by understanding the relationships between different keywords. Web scraping is used to extract the text content from a website, and then Word2Vec is applied to this text to find out which keywords are most important and how they relate to each other. This helps in optimizing the website's content, making it more relevant and effective for search engines like Google.

**How does Word2Vec work in this project?**


*   **Word2Vec** is a tool that helps the computer understand the meanings of words based on the context they appear in. In this project, Word2Vec is trained on the text content of a website to create a map of word relationships. For example, it might learn that the words "SEO," "marketing," and "services" are closely related because they frequently appear together in the website's content. This allows us to identify important keywords and understand how they connect with each other, which is crucial for optimizing the website's content.






# What is Word2Vec?
*   **Word2Vec** is a technique used in **Natural Language Processing (NLP)** to understand the meanings of words and their relationships to each other. Think of it as a way to teach a computer how words are connected by turning them into mathematical numbers called **"vectors."** These vectors help the computer understand that some words are more closely related than others.

# Why is Word2Vec Important?

*   Imagine you have a large amount of text, like a book or the entire internet. You want the computer to understand that the words "king" and "queen" are related or that "Paris" is to "France" as "Rome" is to "Italy." Word2Vec helps the computer figure out these relationships without needing to be explicitly told.

# How Does Word2Vec Work?

*   **Word2Vec works by reading lots of text and learning from it.** It creates a map where each word is represented by a point in space. Words that are similar or often appear together are placed closer to each other on this map.

**Here’s a step-by-step breakdown:**


1.   **Reading the Text:** The computer reads through text and looks at each word in the context of the words around it. For example, in the sentence "The king sat on the throne," it sees that "king" is often near "throne."


2.   **Learning Relationships:** Over time, the computer learns that certain words frequently appear together. It starts to understand that "king" and "queen" are related because they often appear in similar contexts (like "sat on the throne" or "ruled the kingdom").


3.   **Creating Vectors:** The computer then represents each word as a vector—a list of numbers that encode the word's meaning based on its context. The more similar the context, the more similar the vectors.

**Example of How Word2Vec Works**

Let’s say you have a collection of sentences:

*   "The king rules the kingdom."
*   "The queen rules the kingdom."
*   "The king and queen are royalty."
*   "The prince is part of the royal family."

**Word2Vec** reads these sentences and notices that **"king"** and **"queen"** often appear in similar situations **(ruling the kingdom, being royalty).** It then places **"king"** and **"queen"** close to each other in its map because they share similar contexts.

*Now*, if you ask the computer to find words similar to **"king,"** it might suggest **"queen,"** **"prince,"** or **"royalty"** because these words appear in similar contexts.

# Why is This Useful?

**Word2Vec** is powerful because it allows the computer to make predictions and understand relationships between words without needing to be explicitly told. For instance, if you give the computer the word **"Paris"** and ask it to find similar words, it might return **"France,"** **"Rome,"** or **"London"** because it has learned that **Paris** is often associated with countries and capitals.

# Practical Example in Everyday Use

*   Let’s say you’re using an online store's search engine. You type in **"smartphone,"** and the engine also shows results for **"mobile phone"** or **"cell phone."** This is because **Word2Vec** (or a similar technique) has learned that these words are used in similar contexts and are therefore related.

# Word2Vec vs. Traditional Methods

*   **Before Word2Vec,** computers might have treated words simply as individual, unrelated items. If you searched for **"king,"** it wouldn’t know to also show **"queen"** or **"prince."** **Word2Vec** changed this by understanding that words have **relationships** and **meanings** based on how they are used together.



































**requests:** This tool allows us to ask a website to give us its content. Think of it like a browser that visits a website and brings back the web page.

**BeautifulSoup:** Once we have the content of the website, this tool helps us pick out the important parts (like text) and ignore the rest (like ads or menus).

**re:** This tool helps us clean up the text by removing extra spaces, making it easier to read.

**NLTK (Natural Language Toolkit) library:**

**nltk:** This is a popular Python library that helps us work with human language data, like text.

**word_tokenize:** This tool breaks down a sentence into individual words.

**sent_tokenize:** This tool breaks down a large piece of text into individual sentences.

**Explanation:**

*  **Why it’s needed:** To analyze text, it’s often helpful to break it down into smaller pieces (sentences and words) that the computer can work with more easily.

**nltk.download('punkt')**

*   **Explanation:** Before we can use sent_tokenize and word_tokenize, we need to make sure the necessary data is available. The punkt tokenizer models are pre-trained data that help NLTK understand how to split text into sentences and words.


*   **Why it’s needed:** Think of it like downloading a map before you go on a trip. This map helps the tools understand how to navigate through the text and break it down correctly.















In [None]:
import requests  # Library to send HTTP requests to a website and get its content
from bs4 import BeautifulSoup  # Library to parse and extract information from HTML content
from gensim.models import Word2Vec  # Library to create and train the Word2Vec model
import re  # Library to perform regular expression operations for text cleaning
import nltk  # Natural Language Toolkit library for text processing
from nltk.tokenize import word_tokenize, sent_tokenize  # Functions to split text into sentences and words

# Download necessary data for sentence and word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 1. Sending a Request to the Website

**response = requests.get(url)**

**Explanation:** This line asks the website to give us its content. When you visit a website in your browser, your browser is doing the same thing—it’s asking the website to send its page so you can see it. Here, the requests.get(url) line is doing that for us. We save the website’s response in a variable called response.


In [None]:
# Send a GET request to the website to retrieve its content
url='https://thatware.co/'
response = requests.get(url)


# 2. Checking if the Request was Successful
**if response.status_code != 200:**

    print(f"Failed to retrieve content from {url}")
    return ""

*   **Explanation:** After asking for the website's content, we need to check if the request was successful. Websites can sometimes refuse to give us their content or may not be available. The status_code tells us if everything went okay:

*   **200:** This code means "Success!" and the website has given us its content.

*   **Not 200:** If we don’t get a 200, something went wrong, and the website didn’t give us what we asked for. In that case, we print a message saying we failed and return an empty result ("").






In [None]:
# Check if the request was successful (status code 200 means success)

if response.status_code != 200:
    print(f"Failed to retrieve content from {url}")



# 3. Parsing the HTML Content

**soup = BeautifulSoup(response.text, 'html.parser')**

*   **Explanation:** If the request was successful, we now have the content of the website, but it’s in a format called HTML (the code used to create web pages). This HTML has a lot of extra stuff like images, buttons, and links. We use BeautifulSoup to turn this HTML into something we can work with more easily, focusing on the text part. We call this cleaned-up content soup.






In [None]:
# Parse the HTML content of the website using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
soup

<!DOCTYPE html>

<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-US"> <![endif]-->
<!--[if gt IE 9]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head><meta charset="utf-8"/><script>if(navigator.userAgent.match(/MSIE|Internet Explorer/i)||navigator.userAgent.match(/Trident\/7\..*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]nowprocket/)){if(href.indexOf("?")==-1){if(href.indexOf("#")==-1){document.location.href=href+"?nowprocket=1"}else{document.location.href=href.replace("#","?nowprocket=1#")}}else{if(href.indexOf("#")==-1){document.location.href=href+"&nowprocket=1"}else{document.location.href=href.replace("#","&nowprocket=1#")}}}}</script><script>(()=>{class RocketLazyLoadScripts{constructor(){this.v="1.2.6",this.triggerEvents=["keydown","mousedown","mousemove","touchmove","touchstart","touchend","wheel"],this.userEventHandler=this.t.bind(this),this.touchStartHandler=this.i.bind(this),this.touchMoveHandler=this.o.bind(this),this.touchEndHandler=th

# 4. Removing Unnecessary Parts

**for script_or_style in soup(['script', 'style']):**

    script_or_style.decompose()


*   **Explanation:** Websites often have things like scripts (which make the site interactive) and styles (which make it look nice). These aren’t useful for us when we just want the text, so we remove them. The decompose() method completely removes these parts from the content. Imagine you’re looking at a book, and you just want to read the words without the pictures or decorations—this step helps us do that.





In [None]:
# Remove <script> and <style> tags to clean up the text content
for script_or_style in soup(['script', 'style']):
    script_or_style.decompose()

# 5. Extracting the Text

**text = soup.get_text()**

*   **Explanation:** Now that we’ve cleaned up the website content, we use get_text() to pull out just the text. This is like copying all the words from a webpage into a blank document, ignoring everything else like images, buttons, and links.







In [None]:
# Extract the text content from the cleaned HTML
text = soup.get_text()
text

'\n\n  \n\n\n\n\n\n\n\nTHATWARE® - AI Powered SEO & Best Advanced SEO Agency\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSERVICES\n\nAdvanced SEO\nADVANCED DIGITAL MARKETING\nADVANCED LINK BUILDING\nFULLY MANAGED SEO\nBUSINESS INTELLIGENCE\nPAID MARKETING\nGoogle penalty recovery\nConversion Rate Optimization\nSOCIAL MEDIA MARKETING\nInformation Retrieval & NLP Services\nMarket Research Services\nCompetitor Keyword Analysis and Research Services\nContent Writing Services\nContent Proofreading Services\nWeb Development Services\nGraphic Design Services\nTechnology Consulting Services\nAWS Managed Services\nWebsite Maintenance Services\nBug and Software Testing Services\nCustom Software Development Services (SAAS)\nMobile and Web App Services\nUX Design Services\nUI Services\nChatbot Services\nWebsite Design Services\n\n\nWhy AI ?\nOUR COMPANY\n\nABOUT\nHOW IT WORKS\nHOW WE MANAGE\nCAREER\nABOUT AUT

# 8. Cleaning Up the Text

**text = re.sub(r'\s+', ' ', text).strip()**

*   **Explanation:** After getting the text, there might be a lot of extra spaces or new lines (like when you press Enter in a document). This line cleans all that up:

  *   **re.sub(r'\s+', ' ', text):** This part replaces all the extra spaces and new lines with a single space, making the text look neat.

  *   **.strip():** This removes any extra spaces from the beginning or end of the text, ensuring the text starts and ends cleanly.

# Step-by-Step Explanation
**1. Understanding \s+**

*   **Explanation:** The **\s** is a special pattern in regular expressions (regex) that matches any whitespace character. This includes:
   *   **Space**
   *   **Tabs (like when you press the "Tab" key)**
   *   Newline characters (like when you press "Enter")

*   The **+** sign means **"one or more"** of these characters. So **\s+** will match any sequence of one or more **whitespace characters.**

**Example:**

*   **Input:** "This   is   a    sentence."
*   **Matched by \s+:**  It will match the double space between "This" and "is", as well as the triple space between "is" and "a".

**2. Using re.sub() to Replace**

**Explanation:**

*   **Input:** "This   is a   sentence."
*   **Output:** "This is a sentence."

*   **What Happened:** The multiple spaces between words were replaced by a single space.

**3. Using .strip() to Remove Leading and Trailing Spaces**

*   **Example Before strip():**
*   **Input:**  "     This is a sentence.   "
*   **Output After strip():**  "This is a sentence."
*   **What Happened:** The extra spaces at the start and end of the string were removed.

# Putting It All Together

**Original Text**


**text = "  This   is  a     sample   text.   "**

*   **What the text looks like:** There are multiple spaces between words and extra spaces at the beginning and end.

**Applying the re.sub() Part**


*   text = re.sub(r'\s+', ' ', text)
*   **After re.sub:** The text becomes " This is a sample text. "
*   **What happened:** All the multiple spaces between words were replaced by a single space.




















































In [None]:
# Clean the text by removing extra spaces, tabs, and newlines
text = re.sub(r'\s+', ' ', text).strip()
text

'THATWARE® - AI Powered SEO & Best Advanced SEO Agency SERVICES Advanced SEO ADVANCED DIGITAL MARKETING ADVANCED LINK BUILDING FULLY MANAGED SEO BUSINESS INTELLIGENCE PAID MARKETING Google penalty recovery Conversion Rate Optimization SOCIAL MEDIA MARKETING Information Retrieval & NLP Services Market Research Services Competitor Keyword Analysis and Research Services Content Writing Services Content Proofreading Services Web Development Services Graphic Design Services Technology Consulting Services AWS Managed Services Website Maintenance Services Bug and Software Testing Services Custom Software Development Services (SAAS) Mobile and Web App Services UX Design Services UI Services Chatbot Services Website Design Services Why AI ? OUR COMPANY ABOUT HOW IT WORKS HOW WE MANAGE CAREER ABOUT AUTHOR SEO CASE STUDIES AI CASE STUDIES AI SEO BLUEPRINT AI-SEO Video Become Our Reseller Our Corporate Deck SEO FAQ’s for Clients FAQ BLOGS CONTACT Pricing 360 Degree SEO Package Enterprise SEO Prici

# 1. Breaking the Text into Sentences

**sentences = sent_tokenize(text)**

*   **Explanation:** The first step in the function is to break the text into sentences using sent_tokenize. This means that if you give the function a paragraph, it will split it into separate sentences.

**Example:**

*   **Input Text: "Hello world. This is an example sentence."**
*   **Output After sent_tokenize:** ["Hello world.", "This is an example sentence."]

*   **What Happened:** The paragraph was split into two sentences.

# 2. Breaking Sentences into Words and Converting to Lowercase

**tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]**

**Explanation:** This line does two important things:

*   **sentence.lower():** Converts each sentence to lowercase. This makes sure that words like **"Hello"** and **"hello"** are treated as the same word, ignoring case differences.

*   **word_tokenize(sentence.lower()):** Breaks each sentence into individual words. This means that each sentence is split into the words that make it up.

**Example:**

*   **Input Sentences:** ["Hello world.", "This is an example sentence."]

*   **Output After word_tokenize:** [["hello", "world", "."], ["this", "is", "an", "example", "sentence", "."]]

*   **What Happened:** Each sentence was split into words, and everything was converted to lowercase.



















In [None]:
# Get the list of stopwords from NLTK
stop_words = set(stopwords.words('english'))

# Split the text into sentences
sentences = sent_tokenize(text)
# Split each sentence into words, convert them to lowercase, and remove stopwords
tokenized_sentences = [
        [word for word in word_tokenize(sentence.lower()) if word not in stop_words]
        for sentence in sentences
    ]

tokenized_sentences

[['thatware®',
  '-',
  'ai',
  'powered',
  'seo',
  '&',
  'best',
  'advanced',
  'seo',
  'agency',
  'services',
  'advanced',
  'seo',
  'advanced',
  'digital',
  'marketing',
  'advanced',
  'link',
  'building',
  'fully',
  'managed',
  'seo',
  'business',
  'intelligence',
  'paid',
  'marketing',
  'google',
  'penalty',
  'recovery',
  'conversion',
  'rate',
  'optimization',
  'social',
  'media',
  'marketing',
  'information',
  'retrieval',
  '&',
  'nlp',
  'services',
  'market',
  'research',
  'services',
  'competitor',
  'keyword',
  'analysis',
  'research',
  'services',
  'content',
  'writing',
  'services',
  'content',
  'proofreading',
  'services',
  'web',
  'development',
  'services',
  'graphic',
  'design',
  'services',
  'technology',
  'consulting',
  'services',
  'aws',
  'managed',
  'services',
  'website',
  'maintenance',
  'services',
  'bug',
  'software',
  'testing',
  'services',
  'custom',
  'software',
  'development',
  'services'

# 1. Importing the Word2Vec Library

**from gensim.models import Word2Vec**

*   **Explanation:** The first thing we do is import the Word2Vec tool from the gensim library. This tool helps us create a model that understands the relationships between words based on how they are used in sentences. Think of it as teaching the computer how to recognize that words like **"dog"** and **"puppy"** are related because they often appear in similar situations.

# 2. Training the Word2Vec Model

**model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)**

**Explanation:** This line is where the actual training happens. We create a Word2Vec model by giving it our tokenized sentences and setting some specific parameters. Let’s break down what each part means:

*   **sentences=tokenized_sentences:** This tells the model to learn from the sentences we provide. Each sentence is a list of words, and the model looks at how these words appear together.


*   **vector_size=100:** Words are turned into vectors, which are lists of numbers that represent the meaning of a word. The vector_size=100 means that each word will be represented by a list of 100 numbers. A higher number can capture more details, but it also requires more computer power.


*   **window=5:** This defines the "window" size, or how many words before and after the target word the model should look at. For example, if the window size is 5, the model looks at the 5 words before and 5 words after the target word to understand its context. A larger window means the model considers a broader context.


*   **min_count=1:** This tells the model to include all words that appear at least once in the text. If we set this number higher, the model would ignore words that appear only a few times, focusing instead on more common words.


*   **workers=4:** This tells the computer to use 4 CPU cores to train the model, which makes the training faster.

**Example:**

**Suppose we have the following tokenized sentences:**


*   **tokenized_sentences =** [

    ["the", "cat", "sat", "on", "the", "mat"],

    ["the", "dog", "barked", "at", "the", "cat"],

    ["the", "cat", "chased", "the", "mouse"]
]

*   The **Word2Vec** model will learn that **"cat"** is often near words like **"sat"**, **"mat"**, **"dog"**, and **"mouse"**. It will create a vector (a list of numbers) for **"cat"** that captures these relationships.

**Explanation:** This trained model can now be used to find out how similar different words are, or to understand the context in which words are used.

**Example:**


*   Once the model is trained, you can ask it questions like "What words are similar to **'cat'**?" and it might answer with **"dog"**, **"mouse"**, or **"pet"** based on the context it learned from the sentences.






















In [None]:
# Create and train the Word2Vec model with specific parameters:
    # vector_size: The dimensionality of the word vectors
    # window: The maximum distance between the current and predicted word within a sentence
    # min_count: Ignores all words with total frequency lower than this
    # workers: The number of worker threads to train the model (uses multiple cores)
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
model

<gensim.models.word2vec.Word2Vec at 0x787348d62830>

In [None]:
import requests  # Library to send HTTP requests to a website and get its content
from bs4 import BeautifulSoup  # Library to parse and extract information from HTML content
from gensim.models import Word2Vec  # Library to create and train the Word2Vec model
import re  # Library to perform regular expression operations for text cleaning
import nltk  # Natural Language Toolkit library for text processing
from nltk.corpus import stopwords  # Function to get a list of stop words in English
from nltk.tokenize import word_tokenize, sent_tokenize  # Functions to split text into sentences and words

# Download necessary data for sentence and word tokenization
nltk.download('punkt')
nltk.download('stopwords')

# List of irrelevant words to remove
irrelevant_words = ['bally', 'surrounding', '2024', '(', ')', ':',':-',',', 'example', 'irrelevant', 'word1', 'word2', 'word3']  # Add more words as needed

def scrape_website(url):
    """
    This function sends a request to the website and extracts the main text content.
    It removes unnecessary HTML tags like scripts and styles, leaving only the meaningful text.
    """
    try:
        # Send a GET request to the website to retrieve its content
        response = requests.get(url)

        # Check if the request was successful (status code 200 means success)
        if response.status_code != 200:
            print(f"Failed to retrieve the content from {url}")
            return ""

        # Parse the HTML content of the website using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Remove <script> and <style> tags to clean up the text content
        for script_or_style in soup(['script', 'style']):
            script_or_style.decompose()

        # Extract the text content from the cleaned HTML
        text = soup.get_text()

        # Clean the text by removing extra spaces, tabs, and newlines
        text = re.sub(r'\s+', ' ', text).strip()

        return text  # Return the cleaned text content
    except Exception as e:
        # Print an error message if something goes wrong
        print(f"An error occurred while scraping {url}: {e}")
        return ""

def preprocess_text(text, irrelevant_words):
    """
    This function breaks down the text into sentences and then into words.
    It removes stop words, irrelevant words, and integers to keep only meaningful words for analysis.
    It returns a list of sentences, where each sentence is a list of words.
    This step is crucial for training the Word2Vec model.
    """
    # Get the list of stopwords from NLTK
    stop_words = set(stopwords.words('english'))

    # Combine stop words and irrelevant words
    all_words_to_remove = stop_words.union(set(irrelevant_words))

    # Split the text into sentences
    sentences = sent_tokenize(text)

    # Split each sentence into words, convert them to lowercase, and remove stopwords, irrelevant words, and integers
    tokenized_sentences = [
        [word for word in word_tokenize(sentence.lower())
         if word not in all_words_to_remove and not word.isdigit()]
        for sentence in sentences
    ]

    return tokenized_sentences  # Return the list of tokenized sentences

def train_word2vec_model(tokenized_sentences):
    """
    This function trains a Word2Vec model using the tokenized sentences.
    The model learns word embeddings based on the context in which words appear.
    """
    model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
    return model

def analyze_website_content(url, keywords):
    """
    This function scrapes content, preprocesses it, trains a Word2Vec model,
    and analyzes the given keywords by displaying their vector representations
    and the most similar words.
    """
    text = scrape_website(url)
    if not text:
        print("No content found on the website.")
        return

    # Preprocess the text with stop words, irrelevant words, and integers removed
    tokenized_sentences = preprocess_text(text, irrelevant_words)

    # Train the Word2Vec model using the tokenized sentences
    model = train_word2vec_model(tokenized_sentences)

    for word in keywords:
        if word in model.wv:
            print(f"Vector for the word '{word}':\n{model.wv[word]}\n")
            similar_words = model.wv.most_similar(word, topn=10)
            print(f"Words most similar to '{word}':\n")
            for similar_word, similarity in similar_words:
                print(f"{similar_word}: {similarity:.4f}")
        else:
            print(f"The word '{word}' is not in the vocabulary.\n")

# Example URL of the website to be analyzed
url = 'https://thatware.co/'

# Example keywords to analyze
keywords = ['seo', 'services', 'marketing', 'development']

# Run the analysis on the specified website
analyze_website_content(url, keywords)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Vector for the word 'seo':
[-1.78686879e-03  1.27924746e-03  5.19233290e-03  8.99612997e-03
 -9.19000432e-03 -9.17629991e-03  7.51689123e-03  1.22698583e-02
 -6.20630383e-03 -4.67381906e-03  6.54071756e-03 -3.36894067e-03
 -4.76285303e-03  6.43503433e-03 -4.65196418e-03 -2.46845745e-03
  3.44370888e-03  1.74699439e-04 -8.33628792e-03 -1.24707511e-02
  8.03449936e-03  6.08580979e-03  6.84398133e-03  6.18427177e-04
  6.05642330e-03 -2.94890860e-03 -2.57145637e-03  5.27323503e-03
 -8.33064225e-03 -3.82944569e-03 -5.56121068e-03 -6.07134483e-04
  1.06084896e-02 -7.75583880e-03 -3.37055838e-03 -7.22634024e-04
  8.58526677e-03 -7.04919035e-03 -1.46026840e-03 -7.42772175e-03
 -1.01934317e-02  4.55531711e-03 -8.79405811e-03 -4.74975724e-03
  1.11040357e-03 -9.04066372e-04 -8.42820480e-03  1.01898331e-02
  6.58653909e-03  1.01182284e-02 -7.88378250e-03  3.40578193e-03
 -4.30685747e-03  2.48141092e-04  7.60090491e-03 -3.61191621e-03
  5.62928896e-03 -6.97219465e-03 -5.15582692e-03  9.66196880e-0

In [None]:
import requests  # Library to send HTTP requests to a website and get its content
from bs4 import BeautifulSoup  # Library to parse and extract information from HTML content
from gensim.models import Word2Vec  # Library to create and train the Word2Vec model
import re  # Library to perform regular expression operations for text cleaning
import nltk  # Natural Language Toolkit library for text processing
from nltk.corpus import stopwords  # Function to get a list of stop words in English
from nltk.tokenize import word_tokenize, sent_tokenize  # Functions to split text into sentences and words

# Download necessary data for sentence and word tokenization
nltk.download('punkt')
nltk.download('stopwords')

# List of irrelevant words to remove
irrelevant_words = ['bally', 'surrounding', '2024', '(', ')', ':', ':-', ',', 'example', 'irrelevant', 'word1', 'word2', 'word3']  # Add more words as needed

def scrape_website(url):
    """
    This function sends a request to the website and extracts the main text content.
    It removes unnecessary HTML tags like scripts and styles, leaving only the meaningful text.
    """
    try:
        # Send a GET request to the website to retrieve its content
        response = requests.get(url)

        # Check if the request was successful (status code 200 means success)
        if response.status_code != 200:
            print(f"Failed to retrieve the content from {url}")
            return ""

        # Parse the HTML content of the website using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Remove <script> and <style> tags to clean up the text content
        for script_or_style in soup(['script', 'style']):
            script_or_style.decompose()

        # Extract the text content from the cleaned HTML
        text = soup.get_text()

        # Clean the text by removing extra spaces, tabs, and newlines
        text = re.sub(r'\s+', ' ', text).strip()

        return text  # Return the cleaned text content
    except Exception as e:
        # Print an error message if something goes wrong
        print(f"An error occurred while scraping {url}: {e}")
        return ""

def preprocess_text(text, irrelevant_words):
    """
    This function breaks down the text into sentences and then into words.
    It removes stop words, irrelevant words, and integers to keep only meaningful words for analysis.
    It returns a list of sentences, where each sentence is a list of words.
    This step is crucial for training the Word2Vec model.
    """
    # Get the list of stopwords from NLTK
    stop_words = set(stopwords.words('english'))

    # Combine stop words and irrelevant words
    all_words_to_remove = stop_words.union(set(irrelevant_words))

    # Split the text into sentences
    sentences = sent_tokenize(text)

    # Split each sentence into words, convert them to lowercase, and remove stopwords, irrelevant words, and integers
    tokenized_sentences = [
        [word for word in word_tokenize(sentence.lower())
         if word not in all_words_to_remove and not word.isdigit()]
        for sentence in sentences
    ]

    return tokenized_sentences  # Return the list of tokenized sentences

def train_word2vec_model(tokenized_sentences):
    """
    This function trains a Word2Vec model using the tokenized sentences.
    The model learns word embeddings based on the context in which words appear.
    """
    model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
    return model

def analyze_website_content(websites, keywords):
    """
    This function loops through a list of websites, scrapes content from each, preprocesses it,
    trains a Word2Vec model, and analyzes the given keywords by displaying their vector representations
    and the most similar words.
    """
    for url in websites:
        print(f"Analyzing content from: {url}\n")
        text = scrape_website(url)
        if not text:
            print("No content found on the website.")
            continue

        # Preprocess the text with stop words, irrelevant words, and integers removed
        tokenized_sentences = preprocess_text(text, irrelevant_words)

        # Train the Word2Vec model using the tokenized sentences
        model = train_word2vec_model(tokenized_sentences)

        for word in keywords:
            if word in model.wv:
                print(f"Vector for the word '{word}':\n{model.wv[word]}\n")
                similar_words = model.wv.most_similar(word, topn=10)
                print(f"Words most similar to '{word}':\n")
                for similar_word, similarity in similar_words:
                    print(f"{similar_word}: {similarity:.4f}")
            else:
                print(f"The word '{word}' is not in the vocabulary.\n")
        print("="*50 + "\n")

# Example URLs for Testing
websites = [
    'https://thatware.co/',        # ThatWare homepage
    'https://www.incrementors.com/',  # Incrementors homepage
    'https://www.techwebers.com/',        # Techwebers homepage
    'https://www.seotechexperts.com/seo-agency-india.html'     # SEO Tech Experts homepage
]

# Example keywords to analyze
keywords = ['seo', 'services', 'marketing', 'development']

# Run the analysis on the specified websites
analyze_website_content(websites, keywords)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Analyzing content from: https://thatware.co/

Vector for the word 'seo':
[-1.78686879e-03  1.27924746e-03  5.19233290e-03  8.99612997e-03
 -9.19000432e-03 -9.17629991e-03  7.51689123e-03  1.22698583e-02
 -6.20630383e-03 -4.67381906e-03  6.54071756e-03 -3.36894067e-03
 -4.76285303e-03  6.43503433e-03 -4.65196418e-03 -2.46845745e-03
  3.44370888e-03  1.74699439e-04 -8.33628792e-03 -1.24707511e-02
  8.03449936e-03  6.08580979e-03  6.84398133e-03  6.18427177e-04
  6.05642330e-03 -2.94890860e-03 -2.57145637e-03  5.27323503e-03
 -8.33064225e-03 -3.82944569e-03 -5.56121068e-03 -6.07134483e-04
  1.06084896e-02 -7.75583880e-03 -3.37055838e-03 -7.22634024e-04
  8.58526677e-03 -7.04919035e-03 -1.46026840e-03 -7.42772175e-03
 -1.01934317e-02  4.55531711e-03 -8.79405811e-03 -4.74975724e-03
  1.11040357e-03 -9.04066372e-04 -8.42820480e-03  1.01898331e-02
  6.58653909e-03  1.01182284e-02 -7.88378250e-03  3.40578193e-03
 -4.30685747e-03  2.48141092e-04  7.60090491e-03 -3.61191621e-03
  5.62928896e-03 


# Understanding Word2Vec and Its Benefits for Website Owners

Word2Vec is a powerful tool that helps us understand the relationships between words based on how they appear together in a text. It takes large amounts of text (like the content of a website) and learns how words are related to each other by looking at the context in which they are used. This allows it to create something called word "vectors," which are essentially a list of numbers that represent the meaning of a word based on its context.

# Use Cases of Word2Vec

**1.   Content Optimization:**

*   **What It Does:** Word2Vec helps website owners see how different words on their site are connected. For example, it can show that "SEO" and "marketing" are often used together, meaning they are related topics on your site.

*   **Benefit:** By understanding these connections, website owners can focus on key terms that are important to their business. They can use this information to make sure that important words are used in the right places and in the right context, making the content more relevant to what users are searching for.

**2.   Search Engine Optimization (SEO):**

*   **What It Does:** Word2Vec helps identify which keywords are related and how they should be used together. For example, if you know that "SEO" and "services" are closely related on your site, you can optimize your content to make sure these words appear together in important places like headings, titles, and meta descriptions.

*   **Benefit:** By optimizing your content this way, your site becomes more relevant to search engines like Google. This can help improve your website’s ranking on search results pages, making it more likely that people will find your site when they search for related topics.

**3.   Increasing Website Traffic:**

*   **What It Does:** When your site is better optimized for search engines, it appears higher in search results. More people will find your site, which can lead to more visitors.

*   **Benefit:** More traffic means more potential customers or readers, which can lead to more sales, more ad revenue, or more influence in your niche.














# How to Use the Output of Word2Vec

Now, let’s look at the output you received and understand how you can use it to improve your website:

**Example of Output Interpretation**

**1.   Understanding Vectors:**

*   **Output Example:** You received a vector for the word "SEO". It looks something like this:
[-0.00217646, 0.00206102, 0.00534912, ...]

*   **Explanation:** This is a mathematical representation of the word "SEO" based on its context in your website's content. While the numbers themselves might not make sense to a human, they are very useful for the computer in understanding what "SEO" means in relation to other words on your site.

**2.   Similar Words:**

*   **Output Example:** You also received a list of words similar to "SEO" like "link-building", "marketing", etc.

**Words most similar to 'seo':**

**link-building: 0.3009**

**marketing: 0.2522**

*   **Explanation:** This list tells you that on your site, the word "SEO" is often used in similar contexts as "link-building" and "marketing". This means these words are related and could be important for your content.














# What Should You Do Next?

**1.   Optimize Content Based on Keywords:**

 *   **Action:** Look at the keywords that are similar to your important terms like "SEO" and "services". Make sure these related words are included naturally in your content. For example, if "SEO" and "marketing" are related, ensure that your content about SEO also mentions marketing strategies.


 *   **Benefit:** This makes your content more comprehensive and relevant, which is something search engines like Google value. It can help improve your ranking.

**2.   Focus on High-Value Keywords:**

*   **Action:** Identify the most important keywords for your business (like "SEO", "services", "marketing") and ensure they are prominently featured in your content. Use them in headings, meta descriptions, and within the body text.

*   **Benefit:** By doing this, you tell search engines that these are key topics on your site, which can help rank your site higher when people search for these terms.

**3.   Update and Refine Content:**

*   **Action:** If some important words are missing from your content, add them where appropriate. For example, if "SEO" is related to "link-building", but your content doesn’t mention link-building, consider adding a section about it.


*   **Benefit:** This makes your content more detailed and useful, which can lead to better user engagement and higher rankings.

**4.   Monitor and Adjust:**

*   **Action:** After optimizing your content, monitor your website’s performance using tools like Google Analytics. See if your changes lead to more traffic or better rankings.

*   **Benefit:** This helps you understand what’s working and what might need further adjustment, ensuring that your site continues to improve over time.


















# How to Identify High-Value Keywords Using Word2Vec
* When you get the output from Word2Vec, it gives you a list of words that are closely related to your chosen keywords, along with how similar they are to each other. This information is valuable because it helps you understand which words are important on your website and how they relate to each other. To determine which keywords are high-value, follow these steps:

**Step 1: Look at the Similarity Scores**

*   **What You See:** When you run the Word2Vec analysis, you get a list of words that are similar to your chosen keywords. Each similar word comes with a similarity score, which is a number between **0 and 1**. The closer this number is to **1**, the more related the words are.

*   **For the keyword "SEO", you might see:**
  * Words most similar to 'seo':

  * link-building: 0.3009

  * marketing: 0.2522

  * principles: 0.2401

* Here, **"link-building"** has the highest similarity **score (0.3009),** meaning it's the most related word to **"SEO"** in your content.

**Step 2: Identify High-Value Keywords**
*   **What to Do:** High-value keywords are the ones that are most related to your primary keywords **(like "SEO" or "marketing")** and are essential to **your business.** These are words that you want to be closely associated with your **brand or services.**

**How to Decide:**

* **High Similarity Score:** Look for words with higher similarity scores. These are more likely to be important because they are used in contexts similar to your primary keyword.

* **Business Relevance:** Consider how relevant the word is to your business. For example, if you run a digital marketing agency, words like "link-building" and "marketing" would be high-value because they are directly related to your services.

* **Frequency in Content:** Think about how often these words appear on your website. Words that are both frequent and have high similarity scores are likely high-value.

**Example :**

* Let’s say you’re focusing on the keyword **"SEO".** If **"link-building"** and **"marketing"** both have high similarity scores and are relevant to your business, **you should treat them as high-value keywords.**

**Step 3: Use High-Value Keywords in Your Content**

**What to Do:** Once you've identified your high-value keywords, make sure they are prominently featured on your website. This means using them in key places like:

* **Headings:** Titles of your articles or sections should include these keywords.

* **Meta Descriptions:** The short descriptions that appear in search results should mention these keywords.

*  **Body Text:** Make sure these keywords are used naturally throughout your content.

**Example:**

*  If **"SEO" and "link-building"** are high-value keywords for your website, you might have a heading like **"Effective SEO and Link-Building Strategies".** You would also mention **"SEO"** and **"link-building"** multiple times throughout your content, especially in key sections.



















  









# 1. Why Are Some Words Like "often" and "also" Considered Similar to "SEO"?
*   **Reason:** Word2Vec learns word relationships based on how often words appear together in the text, not based on their actual meaning. If "often" and "also" are showing up as similar to "SEO", it could mean that in the content of your website, these words appear in similar contexts as "SEO". However, this might also indicate that the text on your website is not very focused, or there might be some noise (irrelevant words) in the data that are being picked up.

*   **Example:** Imagine if on your website, "SEO" often appears in sentences where "often" and "also" is used, even if they are not directly related. Word2Vec will still learn that these two words are related just because they frequently appear near each other.


# Steps to Take When You Get Such Outputs

**Step 1: Review Your Content**

* **Action:** Go back to the content on your website and see where words like "often" and "also" are being used in relation to "SEO". Are these words really related in a meaningful way, or are they just appearing together by coincidence? Do the same for "development" and "every".

* **Example:** If "often" and "also" is appearing in sentences talking about the "SEO strategies surrounding content marketing", you might consider rephrasing to make the connection clearer. If "halt" or "every" seem out of place, consider editing your content to make it more focused.

**Step 2: Clean Your Data**

* **Action:** If you find that irrelevant words (like "halt" or "every") are being considered similar to important keywords, it might be useful to clean your data. This could involve removing uncommon or irrelevant words before training the Word2Vec model.

**How to Do This:**

* **Remove Stop Words:** Words like "the", "is", "in" are commonly removed because they don't add much meaning. Similarly, if there are specific words that are irrelevant to your business, you can remove them.

* **Focus Your Content:** Ensure that your content is focused and uses the important keywords in meaningful contexts. This helps Word2Vec learn better connections.























# General Steps to Improve Website Ranking and Traffic

**1. Content Audit:** Conduct a thorough audit of your website content to ensure that it is focused on relevant, high-value keywords. Remove or rewrite content that is too generic or unrelated to the primary topics of your site.

**2. Keyword Research:** Use tools like Google Keyword Planner or SEMrush to identify high-value keywords that are directly related to your business. Focus on incorporating these keywords naturally into your content.

**3. On-Page SEO:** Ensure that important keywords appear in key places like titles, headings, meta descriptions, and the body of your content. Use related keywords to build context and improve relevance.

**4. Optimize for User Intent:** Align your content with the intent of your target audience. If users are searching for "SEO services", ensure your content answers their questions and provides clear, valuable information.

**5. Monitor and Adjust:** Use analytics tools to monitor how changes impact traffic and rankings. Continuously refine your strategy based on what works best.

**6. Engagement and Linking:** Increase internal linking to connect relevant pages within your website. This can help distribute link equity and keep users engaged longer. Also, build external backlinks from reputable sites to boost your authority.











