<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/Reinforcement_Learning_Enhanced_SEO_Automating_Keywords_and_Backlinks_for_Growth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name: Reinforcement Learning-Enhanced SEO: Automating Keywords and Backlinks for Growth**

---

### **Purpose of the Project:**
The purpose of this project is to use **Artificial Intelligence (AI)**, specifically a technique called **Reinforcement Learning (RL)**, to make **Search Engine Optimization (SEO)** smarter and more efficient. The main aim is to help websites grow by **automatically** deciding which **keywords** and **backlinks** to use, without requiring a human to make these decisions manually.

#### **What is SEO?**
SEO is the process of improving a website so it appears higher in search results (like Google). Websites that rank higher get more visitors, and more visitors often mean more business or exposure.

#### **The Problem:**
Traditionally, SEO requires people (like website owners or marketers) to manually pick the right **keywords** (the terms people search for) and create **backlinks** (links from other websites to your own site) to help the website rank higher. This can be time-consuming, and the wrong choices can hurt the website's visibility.

#### **What Does Reinforcement Learning Do?**
**Reinforcement Learning (RL)** is a type of AI that **learns by trying things out** and improving its actions based on the results. Just like how a person learns from experience, RL learns by doing and gets better at making decisions over time.

In this project, **Reinforcement Learning** is used to:
1. **Choose the best keywords** to target based on current website traffic data.
2. **Select the most effective backlinks** to use for improving the website's ranking.
3. **Continuously improve** these choices as more data (such as website visitors, traffic patterns, and engagement levels) is fed into the system.

#### **How Does This Help Website Owners?**
The key benefit of using Reinforcement Learning in SEO is that it **automates** the process of optimizing a website. Instead of relying on human judgment or experience, the AI system makes these decisions **automatically** based on real-time data. This saves time and ensures that the website is constantly improving its chances of ranking higher on search engines like Google.

With this project, the website can:
- **Adapt** to changing traffic levels without human intervention.
- **Test different keywords and backlinks** to see what works best, adjusting strategies in real-time.
- **Boost traffic** by making smarter, data-driven decisions about SEO.

#### **Who Can Benefit from This Project?**
- **Website owners** who want to grow their audience but don't have the time or expertise to manage SEO manually.
- **Marketers** looking to automate their SEO tasks and improve efficiency.
- **Businesses** that rely on website traffic for customers or visibility.

---


### **What is Reinforcement Learning for SEO?**

**Reinforcement Learning (RL)** is a type of machine learning where an algorithm learns by interacting with an environment and gets feedback in the form of rewards or penalties. For SEO (Search Engine Optimization), this means using RL to improve website performance by continuously adjusting strategies like content updates, link-building campaigns, or keyword use based on how these actions affect website ranking and traffic in real time.

### Use Cases of Reinforcement Learning for SEO

1. **Content Optimization**: RL can suggest the best ways to update or create new content by analyzing what drives traffic and improves rankings over time. For example, the algorithm can track the performance of different article topics or keywords and suggest adjustments to improve visibility.
  
2. **Link-Building**: The algorithm can decide where and when to build external links (backlinks) or internal links based on past data. It can learn which types of links bring more traffic and improve rankings, optimizing link-building campaigns automatically.

3. **Keyword Targeting**: RL can help identify which keywords to target or focus on by analyzing which ones are driving traffic over time, allowing the system to adjust strategies dynamically.

### Real-Life Implementation of RL for SEO

Imagine a website where you want to improve SEO performance, say a blog. In this case, RL can be implemented to monitor user behavior, track which pages perform well (in terms of traffic, bounce rate, etc.), and adjust various elements of the website automatically. For example, it could:

- Dynamically suggest which blog posts to promote.
- Recommend changes in content format or structure (like adding images, videos, or headings) to improve user engagement.
- Automatically adjust keywords in your content based on real-time trends.

### Use Case in the Context of a Website

For your project related to a website owner, the RL algorithm can interact with the website’s SEO data. Let’s assume your client wants to improve how their blog ranks on Google. The RL model can monitor how each page performs—such as how long people stay on a page, which pages lead to conversions, or which pages are being ignored. Based on this, it can dynamically suggest:

- Changes to content (like rewriting certain paragraphs, adding keywords, etc.).
- Which old blog posts to update and how.
- Optimal internal links between different blog posts to increase overall engagement.

### How Does the Code Work?

As a non-tech person, don’t worry about the technical complexities. Here’s the simple version:

1. **Input Data**: The RL model needs data to work. This data can either be in the form of URLs from the website (where the algorithm crawls and processes content) **or** in a structured format like a CSV file that contains SEO metrics (like page views, rankings, bounce rates, etc.).

    - **URLs**: If you provide URLs, the algorithm can automatically fetch the page content, analyze it, and decide how to improve it.
    - **CSV Data**: If you use a CSV file, it can contain columns like keywords, ranking positions, page views, etc., which the model will use to make decisions.

2. **RL Process**: The algorithm learns over time. It checks how changes it suggests (like updating content, adding links, or targeting new keywords) affect your SEO performance and adjusts its strategy accordingly. The process involves:
    - **Action**: The RL model suggests an SEO action (like updating a page or building a link).
    - **Feedback**: The model checks if the action improved or hurt performance (like if the page ranks higher or gets more traffic).
    - **Learning**: Based on the feedback, the algorithm learns and suggests better actions over time.

3. **Output**: The final output would be a set of recommendations or automatic updates to the website that are aimed at improving SEO performance, such as:
    - Which blog post to promote or update.
    - How to optimize content for better user engagement.
    - Which keywords to focus on for improving rankings.

### Data Needed for RL in SEO

The model needs real-time performance data to make decisions. Common data includes:
- **Page Traffic**: How many visitors each page gets.
- **Keyword Performance**: How well specific keywords rank over time.
- **User Engagement**: Metrics like bounce rate, time on site, and conversion rates.
- **Backlink Data**: Information about the number and quality of links pointing to the website.
  
The model uses this data to evaluate its actions (like content updates or link-building) and decide what to try next to maximize SEO performance.

### Why is RL Useful for SEO?

RL is useful because SEO is dynamic—search engine algorithms change frequently, and user behavior can shift over time. By using RL, you create a system that constantly adapts and improves based on real-time data. This makes SEO strategies more efficient and reduces the guesswork involved. It helps websites stay competitive in search engine rankings without the need for constant manual intervention.



In [76]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Import necessary libraries
import requests  # Used for making HTTP requests to get content from URLs
from bs4 import BeautifulSoup  # BeautifulSoup is a library for scraping and parsing content from websites
import pandas as pd  # Pandas is used for working with datasets (loading, organizing, and analyzing data)
import re  # 're' is the regular expression library used for cleaning text (like removing digits)
import string  # 'string' library is used for removing punctuation from the text
from sklearn.feature_extraction.text import TfidfVectorizer  # This helps in extracting important keywords from text
import nltk  # 'nltk' (Natural Language Toolkit) is used for processing text
from nltk.corpus import stopwords  # Provides a list of common words like "the", "is", "in", etc., that we want to remove from the text

# Download stopwords if not already downloaded
nltk.download('stopwords')  # This step is needed to get the stopwords from the NLTK library

# Define file paths for the datasets
# These paths should point to the CSV files that contain your website's user flow data.
# Here, they are stored in a Google Drive folder, but you can replace these paths with the actual location of your files.
pagewise_user_flow_data_path = '/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/Pagewise User flow data.csv'
user_flow_data_path = '/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/User Flow Data.csv'
user_behavior_data_path = '/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/User Behavior Data.csv'

# URLs provided for scraping content
# This is a list of URLs (website pages) from which we want to extract text content to analyze keywords or phrases used on each page.
urls = [
    'https://thatware.co/',  # Main website homepage
    'https://thatware.co/services/',  # Services page
    'https://thatware.co/advanced-seo-services/',  # SEO services page
    'https://thatware.co/digital-marketing-services/',  # Digital marketing services
    'https://thatware.co/business-intelligence-services/',  # Business intelligence services
    'https://thatware.co/link-building-services/',  # Link-building services
    'https://thatware.co/branding-press-release-services/',  # Branding and press release services
    'https://thatware.co/conversion-rate-optimization/',  # Conversion rate optimization services
    'https://thatware.co/social-media-marketing/',  # Social media marketing services
    'https://thatware.co/content-proofreading-services/',  # Content proofreading services
    'https://thatware.co/website-design-services/',  # Website design services
    'https://thatware.co/web-development-services/',  # Web development services
    'https://thatware.co/app-development-services/',  # App development services
    'https://thatware.co/website-maintenance-services/',  # Website maintenance services
    'https://thatware.co/bug-testing-services/',  # Bug testing services
    'https://thatware.co/software-development-services/',  # Software development services
    'https://thatware.co/competitor-keyword-analysis/'  # Competitor keyword analysis services
]

# Step 1: Load the datasets provided by the user
# We are using the file paths to load the datasets into Pandas DataFrames. A DataFrame is a table-like structure where data is stored in rows and columns, like an Excel sheet.
# We will load the three different datasets that contain information about user behavior, user flow, and page-specific user flow.

# Load the datasets into DataFrames (structured data tables)
pagewise_user_flow_data = pd.read_csv(pagewise_user_flow_data_path)
user_flow_data = pd.read_csv(user_flow_data_path)
user_behavior_data = pd.read_csv(user_behavior_data_path)

# Display the first few rows of the datasets to ensure they're loaded correctly.
# This step allows us to verify that the data from the CSV files has been read correctly.
print("Pagewise User Flow Data: ")
print(pagewise_user_flow_data.head())  # Show the first 5 rows of the Pagewise User Flow dataset

print("\nUser Flow Data: ")
print(user_flow_data.head())  # Show the first 5 rows of the User Flow dataset

print("\nUser Behavior Data: ")
print(user_behavior_data.head())  # Show the first 5 rows of the User Behavior dataset

# Step 2: Scrape the content from URLs
# We need to extract the text content of each webpage for analysis.
# This function uses the 'requests' library to get the webpage and 'BeautifulSoup' to extract the raw text from it.

def scrape_content(url):
    try:
        # Send a request to the URL to get the webpage content
        response = requests.get(url)
        # Parse the webpage content using BeautifulSoup, which extracts all the HTML from the page
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extract and return the text from the webpage. The separator=' ' ensures that all pieces of text are separated by a space.
        text = soup.get_text(separator=' ')
        return text
    except Exception as e:
        # If there's an error while scraping, print it and return an empty string (so the code doesn't break)
        print(f"Error scraping {url}: {e}")
        return ""  # Returning an empty string means no content was scraped from this URL.

# Step 3: Clean the scraped content
# Once we have the raw text from the webpage, we need to clean it so that it’s easier to analyze.
# Cleaning involves converting text to lowercase, removing digits, punctuation, and removing common words like 'the', 'is', etc.

def clean_text(text):
    """
    This function cleans the text by:
    1. Converting it to lowercase (to avoid treating 'Word' and 'word' differently).
    2. Removing digits (numbers do not add much value in keyword analysis).
    3. Removing punctuation (to clean up the text and focus only on meaningful words).
    4. Removing stopwords (common words like 'the', 'is', 'and' that don't add value to keyword extraction).
    """
    # Convert text to lowercase to make all words uniform (so 'SEO' and 'seo' are treated the same)
    text = text.lower()

    # Remove digits (we don't need numbers for this kind of analysis)
    text = re.sub(r'\d+', '', text)

    # Remove punctuation using the string library, which contains all common punctuation marks
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove common stopwords (like 'the', 'is', 'and') using the NLTK stopwords list
    stop_words = set(stopwords.words('english'))  # Load the English stopwords list from NLTK
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords from the text

    return text  # Return the cleaned text

# Step 4: Scrape and clean the content from each URL
# We now loop through each URL, scrape its content, and clean it using the functions we've defined.
# This is where we combine the scraping and cleaning steps for all URLs.

url_content = {}  # Create an empty dictionary to store the cleaned content for each URL

# Loop through each URL in the 'urls' list
for url in urls:
    content = scrape_content(url)  # Scrape the content from the URL
    cleaned_content = clean_text(content)  # Clean the scraped content
    url_content[url] = cleaned_content  # Store the cleaned content in the dictionary with the URL as the key

# Step 5: Convert the scraped and cleaned content into a DataFrame for further analysis
# We'll now store the scraped and cleaned text in a Pandas DataFrame, making it easier to view, analyze, or export the data.

# Create a DataFrame from the dictionary where each row contains a URL and its corresponding cleaned content
content_df = pd.DataFrame(list(url_content.items()), columns=['URL', 'Cleaned Content'])

# Display the cleaned content from the URLs to verify the result
print("\nScraped and Cleaned Content from URLs:")
print(content_df.head())  # Show the first few rows of the DataFrame with URLs and their cleaned content


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Pagewise User Flow Data: 
  Page path and screen class  Views  Active users  Views per active user  \
0                          /  36464         27161               1.342513   
1                 /services/   3893          2937               1.325502   
2          /360-seo-package/   3380          2745               1.231330   
3               /contact-us/   2805          2216               1.265794   
4    /advanced-seo-services/   1901          1507               1.261447   

   Average engagement time per active user  Event count  Key events  \
0                                29.505651       123776       26181   
1                                29.378618         8632         167   
2                                58.045537         8036         178   
3                                36.500000         6571         118   
4                                47.269409         4347         136   

   Total revenue  
0              0  
1              0  
2              0  
3             

In [None]:
# Import the necessary libraries for TF-IDF (Term Frequency - Inverse Document Frequency)
from sklearn.feature_extraction.text import TfidfVectorizer  # This helps in extracting important keywords
import pandas as pd  # Pandas is used for organizing and displaying data in a table format

# Step 1: Use TF-IDF to extract keywords from the cleaned content
# Here, we are creating an instance of TfidfVectorizer.
# The TfidfVectorizer will convert the cleaned text into a set of keywords with their importance (TF-IDF scores).
# 'max_features=10' means we are limiting it to extract only the top 10 most important keywords for each URL.
tfidf_vectorizer = TfidfVectorizer(max_features=10)  # Extract top 10 keywords per URL

# Step 2: Fit and transform the cleaned content
# 'fit_transform' means we are applying the TF-IDF transformation on the cleaned text content from each webpage.
# It first learns the keywords from the content (fit) and then transforms that content into numerical TF-IDF scores (transform).
# 'content_df['Cleaned Content']' is the column in the DataFrame that contains the cleaned text from each URL.
tfidf_matrix = tfidf_vectorizer.fit_transform(content_df['Cleaned Content'])

# Step 3: Get the keywords (features) and their corresponding scores
# 'get_feature_names_out()' extracts the actual keywords (important words) that TF-IDF identified in the text.
# These keywords are the most important words found on each webpage, ranked by their relevance.
keywords = tfidf_vectorizer.get_feature_names_out()

# Step 4: Create a DataFrame to display the top keywords and their TF-IDF scores for each URL
# Now we want to see the TF-IDF scores in a more readable format.
# The 'tfidf_matrix' contains the actual scores, and we are converting it to a DataFrame so it looks like a table.
# 'tfidf_matrix.toarray()' converts the TF-IDF matrix into a regular array that we can display.
# We also use 'columns=keywords' to make sure the columns in the table represent the actual keywords.
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=keywords)

# Add the URLs to the DataFrame so that we can see which keywords correspond to which URL
# Here, we are adding the 'URL' column from the 'content_df' (which has URLs) to our new table (DataFrame) so that
# we can clearly see which keywords belong to which webpage.
tfidf_df['URL'] = content_df['URL']

# Display the DataFrame showing the top keywords and their TF-IDF scores for each URL
# We print the DataFrame, and it will display the top 10 keywords along with their TF-IDF scores for each URL.
# The TF-IDF score represents how important a keyword is to that particular webpage. Higher scores mean the word is more relevant.
print("\nTF-IDF Keywords for Each URL:")
print(tfidf_df[['URL'] + list(keywords)])  # Show the table with the URLs and the top keywords for each page

# Step 5: Save the result to a CSV file (optional)
# This step is optional. If we want to save the keywords and their scores for later use or share them, we can save them as a CSV file.
# The file will be saved as 'keyword_extraction_tfidf.csv' in the current working directory.
tfidf_df.to_csv('keyword_extraction_tfidf.csv', index=False)  # Save the table to a CSV file, without including row numbers



TF-IDF Keywords for Each URL:
                                                  URL  advanced  business  \
0                                https://thatware.co/  0.179875  0.050365   
1                       https://thatware.co/services/  0.050775  0.030465   
2          https://thatware.co/advanced-seo-services/  0.384554  0.160976   
3     https://thatware.co/digital-marketing-services/  0.116932  0.168382   
4   https://thatware.co/business-intelligence-serv...  0.096151  0.305937   
5         https://thatware.co/link-building-services/  0.264553  0.062657   
6   https://thatware.co/branding-press-release-ser...  0.092411  0.100812   
7   https://thatware.co/conversion-rate-optimization/  0.103880  0.113324   
8         https://thatware.co/social-media-marketing/  0.094523  0.120301   
9   https://thatware.co/content-proofreading-servi...  0.039895  0.053193   
10       https://thatware.co/website-design-services/  0.047931  0.260195   
11      https://thatware.co/web-development-s

In [None]:
# Import necessary libraries for calculating cosine similarity between URLs
# Cosine similarity helps us to find how similar two pieces of text are.
# Here, we'll use it to compare the content of different URLs.
from sklearn.metrics.pairwise import cosine_similarity  # For calculating cosine similarity
import pandas as pd  # For creating and managing data tables (DataFrames)

# Step 1: Calculate cosine similarity between the cleaned content of all URLs
# We will use the TF-IDF matrix that we created in the previous step.
# The TF-IDF matrix contains numerical scores representing how important each keyword is for each webpage.
# Cosine similarity will calculate how similar the content of one webpage is to another by comparing these scores.
cosine_sim_matrix = cosine_similarity(tfidf_matrix)

# Step 2: Create a DataFrame to store the cosine similarity scores
# A DataFrame is like a table. We want to create a table that shows the similarity scores between each URL and every other URL.
# We'll use the URLs as both the rows and columns so that we can easily see which URLs are similar to each other.
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=content_df['URL'], columns=content_df['URL'])

# Step 3: Display the similarity matrix (optional)
# This matrix will show how similar each page is to every other page.
# The value will be between 0 and 1. A value closer to 1 means the pages are very similar.
# For example, a similarity score of 0.95 means two pages are 95% similar in terms of content.
print("\nCosine Similarity Matrix between URLs:")
print(cosine_sim_df)

# Step 4: Identify backlink opportunities
# Now we want to recommend backlink opportunities based on the similarity of the content.
# A backlink is when one website links to another website. It helps with SEO (Search Engine Optimization).
# The idea here is: if two pages are similar, it's a good idea for them to link to each other.
backlink_recommendations = {}  # We'll store the backlink recommendations in this dictionary

# Iterate over each row (each URL) in the cosine similarity matrix
# For each URL, we'll find the top 3 most similar URLs. These will be our backlink suggestions.
for url in cosine_sim_df.index:
    # Sort the URLs by their similarity score, in descending order (from most similar to least similar).
    # We'll exclude the current URL (itself), because a page can't recommend itself.
    similar_urls = cosine_sim_df[url].sort_values(ascending=False).index[1:4]  # Take the top 3 similar URLs

    # Store the top 3 recommendations for each URL
    backlink_recommendations[url] = similar_urls.tolist()  # Convert to a list and add to the dictionary

# Step 5: Display the backlink recommendations for each URL
# Now we'll print out the backlink recommendations.
# For each URL, we will show the top 3 other URLs it should consider linking to (these are the most similar pages).
print("\nBacklink Recommendations for Each URL:")
for url, recommendations in backlink_recommendations.items():
    print(f"For {url}, consider linking to: {recommendations}")

# Step 6: Save the backlink recommendations to a CSV file (optional)
# If you want to keep these recommendations for future use or share them, you can save them in a CSV file.
# This step is optional, but it's useful if you want to keep a record of the recommendations.
backlink_df = pd.DataFrame(list(backlink_recommendations.items()), columns=['URL', 'Backlink Recommendations'])
backlink_df.to_csv('backlink_recommendations.csv', index=False)  # Save as a CSV file



Cosine Similarity Matrix between URLs:
URL                                                 https://thatware.co/  \
URL                                                                        
https://thatware.co/                                            1.000000   
https://thatware.co/services/                                   0.973669   
https://thatware.co/advanced-seo-services/                      0.950085   
https://thatware.co/digital-marketing-services/                 0.693657   
https://thatware.co/business-intelligence-servi...              0.941741   
https://thatware.co/link-building-services/                     0.964001   
https://thatware.co/branding-press-release-serv...              0.948926   
https://thatware.co/conversion-rate-optimization/               0.968990   
https://thatware.co/social-media-marketing/                     0.946683   
https://thatware.co/content-proofreading-services/              0.836870   
https://thatware.co/website-design-services/    

### What is this Output?

This output shows two key pieces of information:
1. **Cosine Similarity Matrix between URLs**: This part shows how similar different pages on your website are to each other in terms of content.
2. **Backlink Recommendations for Each URL**: Based on the similarity between pages, it provides recommendations for which pages should be linked to each other. This can help improve internal linking for SEO purposes.

Let’s go through each part in detail, step by step.

---

### 1. **Cosine Similarity Matrix between URLs**
The Cosine Similarity Matrix compares the content of different URLs (webpages) on your website to determine how similar they are. The similarity is measured using a number between 0 and 1:
- **1**: Perfect similarity (the content is very similar).
- **0**: No similarity (the content is completely different).

The matrix shows the similarity between every combination of URLs on your website.

#### Example from the matrix:
```
URL                                                https://thatware.co/
https://thatware.co/                                             1.000000
https://thatware.co/services/                                    0.973669
```
- **https://thatware.co/** is compared with itself (which is why the value is **1.0**, meaning perfect similarity).
- **https://thatware.co/services/** has a similarity score of **0.973669** with the homepage, meaning the content of the `/services/` page is quite similar to the homepage.

Each row and column represent different pages on your website, and the numbers represent how closely related the content of the pages is.

#### What Does This Mean?
- The Cosine Similarity helps you understand which pages have content that overlaps or is closely related.
- Pages with high similarity might cover similar topics or services, so you can cross-link them to guide users and improve SEO.
  
For example, if two pages have a high similarity, like:
```
https://thatware.co/  ->  https://thatware.co/services/  : 0.973669
```
This suggests that these two pages might benefit from internal linking to help users navigate between similar topics and improve SEO.

---

### 2. **Backlink Recommendations for Each URL**
This section provides **backlink recommendations** based on the similarity scores from the Cosine Similarity Matrix. A "backlink" in this context means creating internal links between pages on your website. These links are beneficial for SEO, as they:
- Help users navigate between related pages.
- Improve the flow of "link juice" (SEO value) within your website, making it more likely for important pages to rank higher on search engines.

Each recommendation tells you which URLs (pages) to link to from a specific page on your website.

#### Example from the recommendations:
```
For https://thatware.co/, consider linking to:
- https://thatware.co/competitor-keyword-analysis/
- https://thatware.co/services/
- https://thatware.co/conversion-rate-optimization/
```
This means that for the homepage (`https://thatware.co/`), you should consider adding links to the following pages:
- **Competitor Keyword Analysis** (`https://thatware.co/competitor-keyword-analysis/`)
- **Services** (`https://thatware.co/services/`)
- **Conversion Rate Optimization** (`https://thatware.co/conversion-rate-optimization/`)

#### What Does This Mean?
- By following these recommendations, you can strengthen the internal linking structure of your website.
- For example, if you have a page about **Conversion Rate Optimization** and it’s highly related to your homepage, you should link to it from your homepage. This will help users and search engines understand that these pages are closely related.

### Summary of Steps You Should Take:
1. **Understand which pages are closely related** using the **Cosine Similarity Matrix**. Look for high similarity scores (closer to 1).
2. **Follow the backlink recommendations** by creating internal links between pages that are related. This improves SEO by connecting similar content and helping search engines understand the relationship between different pages on your website.

---

### Explanation of Each Key Concept in Simple Terms:
- **Cosine Similarity Matrix**: This is a table that shows how similar the content of different pages on your website is. The more similar two pages are, the higher the number (closer to 1) you’ll see. You can use this information to link similar pages together.
  
- **Backlink Recommendations**: These are suggestions for which pages you should link to from each page. Linking related pages together helps users find relevant information more easily and boosts your site's SEO.

---

### What to Do Next:
- **Add internal links** based on the recommendations. For each page, check which other pages are similar and add links between them.
- **Check the content** of pages with high similarity. If two pages are too similar, you might want to adjust the content so each page focuses on a unique topic.
  
---

### Why This is Important:
- **Internal linking** helps search engines understand the structure of your website, which can lead to higher rankings in search results.
- **Improved user navigation**: When users can easily find related content, they spend more time on your site, which is also good for SEO.
  


### **Part 1: Data Loading and Keyword Extraction**
**Name: Data Collection and Keyword Extraction**

**What it does:**
- This part of the code handles **loading website data** and **fetching content** from web pages.
- It uses libraries like **pandas** to load data files and **requests** with **BeautifulSoup** to grab the text from the URLs (web pages).
- The code then **cleans** the text from the web pages by removing unnecessary things like numbers, punctuation, and common words that don't add much meaning (called **stopwords**).
- After cleaning, the code uses a method called **TF-IDF** (Term Frequency-Inverse Document Frequency), which finds the **most important keywords** from the cleaned text of each web page.
- These keywords are extracted and displayed in a table, giving a list of the most relevant words for each web page.

**Purpose:**
The first part of the code **gathers data** and **extracts keywords** that will be used later to make decisions about SEO.

---


In [77]:
# Importing necessary libraries that will be used in the code.
# 'pandas' is used to load and handle datasets (tables of data).
import pandas as pd

# 'requests' is used to fetch the content of a web page from the internet.
import requests

# 'BeautifulSoup' is used to parse (read) the HTML code of web pages.
from bs4 import BeautifulSoup

# 're' is a library that helps clean text by removing unwanted parts like numbers or special characters.
import re

# 'TfidfVectorizer' helps in identifying important words (keywords) from a webpage by analyzing how frequently they appear.
from sklearn.feature_extraction.text import TfidfVectorizer

# 'stopwords' are common words like 'the', 'is', 'in', etc., which we don't need for our analysis.
# 'nltk' is a library that helps in processing and analyzing human language.
from nltk.corpus import stopwords
import nltk

# Downloading the list of stopwords if it's not already available. (Only needs to be done once.)
nltk.download('stopwords')

# Load the list of stopwords in English (words that are too common and don't add much value, like "the", "is", "in").
stop_words = set(stopwords.words('english'))

# Adding extra words to remove from our analysis that are specific to our context and not meaningful for our purpose.
# For example, words like 'also', 'use', 'thatware', 'website' don't help us understand the unique content of a webpage.
custom_stopwords = ['also', 'use', 'are', 'us', 'based', 'building', 'software', 'application', 'thatware', 'web', 'website']

# We combine the standard stopwords with our custom list of stopwords.
stop_words.update(custom_stopwords)

# 1. Loading the datasets (data files) that contain information about user behavior on different web pages.
# These CSV files (tables) store information about how users interact with the website.
# Replace the file paths below with the actual paths where your CSV files are saved.
# In your case, we assume the CSVs store valuable data but we don't use them directly for the keyword extraction process.
pagewise_user_flow_data = pd.read_csv('/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/Pagewise User flow data.csv')
user_flow_data = pd.read_csv('/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/User Flow Data.csv')
user_behavior_data = pd.read_csv('/content/drive/MyDrive/Datasets For Reinforcement Learning for SEO Model/User Behavior Data.csv')

# Displaying (printing) the first few rows of each dataset to make sure the data has been loaded correctly.
# This helps to verify that the files were loaded without errors.
print("Pagewise User Flow Data:")  # Shows user activity on each page.
print(pagewise_user_flow_data.head())

print("\nUser Flow Data:")  # Shows user movement between pages.
print(user_flow_data.head())

print("\nUser Behavior Data:")  # Shows how users behave on the website.
print(user_behavior_data.head())

# 2. Fetching the content of various web pages (the text content that users see).
# Here is a list of the URLs (web addresses) we want to get content from.
# These URLs are provided by you.
urls = [
    'https://thatware.co/',  # Homepage of the website
    'https://thatware.co/services/',  # Services page
    'https://thatware.co/advanced-seo-services/',  # Advanced SEO services page
    'https://thatware.co/digital-marketing-services/',  # Digital marketing services page
    'https://thatware.co/business-intelligence-services/',  # Business intelligence services page
    'https://thatware.co/link-building-services/',  # Link building services page
    'https://thatware.co/branding-press-release-services/',  # Branding and press release services page
    'https://thatware.co/conversion-rate-optimization/',  # Conversion rate optimization page
    'https://thatware.co/social-media-marketing/',  # Social media marketing page
    'https://thatware.co/content-proofreading-services/',  # Content proofreading services page
    'https://thatware.co/website-design-services/',  # Website design services page
    'https://thatware.co/web-development-services/',  # Web development services page
    'https://thatware.co/app-development-services/',  # App development services page
    'https://thatware.co/website-maintenance-services/',  # Website maintenance services page
    'https://thatware.co/bug-testing-services/',  # Bug testing services page
    'https://thatware.co/software-development-services/',  # Software development services page
    'https://thatware.co/competitor-keyword-analysis/'  # Competitor keyword analysis page
]

# Create a list to store the content (text) of each webpage we fetch.
webpage_contents = []

# Loop through each URL in the list, fetch its content, and extract all the text.
# We use 'BeautifulSoup' to parse the HTML and extract the text.
for url in urls:
    response = requests.get(url)  # 'requests.get' fetches the content of the webpage from the internet.
    soup = BeautifulSoup(response.text, 'html.parser')  # 'BeautifulSoup' reads and parses the HTML.
    text = soup.get_text()  # 'get_text()' extracts just the text from the HTML, leaving behind the HTML tags.
    webpage_contents.append(text)  # Add the raw text of the webpage to our list.

# 3. Preprocessing: Cleaning the text data.
# This step involves converting the text into a cleaner format that can be analyzed. Specifically, we:
# - Convert all text to lowercase,
# - Remove digits (numbers),
# - Remove punctuation (like !, ?, etc.),
# - Remove stopwords (like 'the', 'is', etc.) so we can focus on meaningful keywords.
def clean_text(text):
    text = text.lower()  # Convert everything to lowercase.
    text = re.sub(r'\d+', '', text)  # Remove numbers (like '2023' or '50').
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation (like '!', '?', etc.).
    text = re.sub(r'\s+', ' ', text).strip()  # Remove any extra spaces.
    words = text.split()  # Split text into words.
    cleaned_words = [word for word in words if word not in stop_words]  # Remove stopwords (unimportant words).
    return ' '.join(cleaned_words)  # Return cleaned text.

# Apply the 'clean_text' function to all the webpage contents to clean the text.
cleaned_webpage_contents = [clean_text(content) for content in webpage_contents]

# 4. Preparing for keyword extraction using TF-IDF (Term Frequency - Inverse Document Frequency).
# TF-IDF is a method that helps find the most important words in a text by comparing how often they appear.
# We will extract the top 50 most important keywords from each webpage's content.

# Initialize the TF-IDF vectorizer with a limit of 50 keywords per webpage.
vectorizer = TfidfVectorizer(max_features=50)

# Apply the vectorizer to the cleaned webpage contents. This will create a matrix of keyword scores.
X = vectorizer.fit_transform(cleaned_webpage_contents)

# Convert the matrix of keywords and their importance into a DataFrame (table) for easier reading.
keywords_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# 5. Display the extracted keywords for each webpage in a human-readable format.
# We will print the top keywords without showing the TF-IDF scores.
for i, url in enumerate(urls):
    print(f"\nTop Keywords for: {url}")  # Show which webpage we are referring to.
    top_keywords = keywords_df.iloc[i].sort_values(ascending=False).index.tolist()  # Sort and get the top keywords.
    print(", ".join(top_keywords))  # Print the keywords as a readable list.



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Pagewise User Flow Data:
  Page path and screen class  Views  Active users  Views per active user  \
0                          /  36464         27161               1.342513   
1                 /services/   3893          2937               1.325502   
2          /360-seo-package/   3380          2745               1.231330   
3               /contact-us/   2805          2216               1.265794   
4    /advanced-seo-services/   1901          1507               1.261447   

   Average engagement time per active user  Event count  Key events  \
0                                29.505651       123776       26181   
1                                29.378618         8632         167   
2                                58.045537         8036         178   
3                                36.500000         6571         118   
4                                47.269409         4347         136   

   Total revenue  
0              0  
1              0  
2              0  
3              

### What is this output?

The output shows different pieces of data related to **user behavior** and **page performance** on your website, alongside **keyword extraction** from your web pages. This information is used to understand how users are interacting with your website and to analyze which keywords are most important for SEO (Search Engine Optimization). Let’s walk through each section of the output.

### 1. **Pagewise User Flow Data**
This section provides information about user interaction with specific pages on your website. Here's what the columns mean:

- **Page path and screen class**: The specific page on your website (e.g., `/`, `/services/`, etc.).
- **Views**: The total number of page views, i.e., how many times a page was viewed.
- **Active users**: The number of unique users who actively interacted with the page.
- **Views per active user**: This shows how often an active user viewed the page. For example, if a user visited a page twice, the value would be above 1.
- **Average engagement time per active user**: The average time (in seconds or minutes) each active user spent on the page.
- **Event count**: This indicates the total number of events (clicks, scrolls, interactions) that took place on the page.
- **Key events**: These are special or significant events (e.g., button clicks, form submissions) that you might want to track.
- **Total revenue**: This would represent any revenue generated from the page (e.g., through sales), but in this case, the value is 0, indicating no direct revenue tracked.

#### Example from the output:
```
Page path and screen class  Views  Active users  Views per active user  
0                          /  36464  27161  1.34
```
This means that your homepage (`/`) was viewed 36,464 times by 27,161 unique users. Each user viewed the page about 1.34 times on average, indicating some users may have returned to the page.

---

### 2. **User Flow Data**
This section gives insights into how users arrived at your website (referred to as the "channel group") and their behavior. Here’s what each term means:

- **First user primary channel group (Default Channel Group)**: This shows how users first discovered your website. For example, "Organic Search" means users found your site via search engines like Google.
- **Total users**: The total number of users coming from that particular channel.
- **New users**: The number of users who visited your website for the first time.
- **Returning users**: The number of users who have visited your site before and returned.
- **Average engagement time per active user**: The average time these users spent on your site.
- **Engaged sessions per active user**: How many interactions or "sessions" occurred per user.
- **Event count**: The total number of interactions or actions (e.g., clicks, form submissions) made by users.
- **Key events**: The number of important events tracked (e.g., purchases, downloads).
- **User key event rate**: This shows the rate at which users performed key events relative to their interactions.

#### Example from the output:
```
First user primary channel group (Default Channel Group)  Total users  New users  
0  Organic Search  17155  16986
```
This means that 17,155 users came to your site through organic search, and 16,986 of those were visiting your site for the first time.

---

### 3. **User Behavior Data**
This part repeats similar information as "Pagewise User Flow Data" but focuses on user behavior. It shows how users interact with specific pages by looking at:

- **Views**: The total number of views for each page.
- **Active users**: How many users interacted with the page.
- **Views per active user**: How often each user viewed the page.
- **Average engagement time per active user**: How long users stayed on each page.
- **Event count**: Total number of actions (clicks, interactions) performed on the page.
- **Key events**: Important actions like purchases, form submissions, etc.

#### Example from the output:
```
Page path and screen class  Views  Active users  Views per active user  
0                          /  40681  30791  1.32
```
This shows that your homepage had 40,681 views, with 30,791 active users. Each user, on average, viewed the page 1.32 times.

---

### 4. **Extracted Keywords for each webpage (Top 50)**
This section provides the **top 50 keywords** extracted from the content of each webpage. These are the words or phrases that appear most frequently and are most relevant to the page's content. The keywords are ranked by their importance or frequency, and each has a numerical value associated with it, which reflects its significance on the page.

- **Top Keywords**: These are the keywords that your model has identified as important for each page's SEO. The more these keywords are included (in a relevant and natural way), the more likely it is for that page to rank higher on search engines.
  
Each webpage will have its own set of top keywords, and their associated values show how prominent they are in the page’s content.

#### Example from the output:
```
Extracted Keywords for /services/:
  seo: 0.801174
  services: 0.557779
  social: 0.030424
```
This means that on the `/services/` page, the most important keyword is "SEO" with a value of 0.801, followed by "services" (0.557). These keywords are key for optimizing your page for SEO purposes.

---

### What should you do with this data?

As a website owner or marketer, here’s what you can do with the information:

1. **Optimize Content with Keywords**:
   - For each page, look at the **extracted keywords** and make sure those keywords are naturally and frequently used in the content. For example, if "SEO" is a top keyword for your `/services/` page, ensure that the page mentions SEO in headings, text, and metadata.
   - If a keyword that should be important is missing from the top keywords list, consider adding content around that keyword.

2. **Improve User Engagement**:
   - Check the **engagement time per user** for different pages. If users are spending less time on a page than expected, it may indicate that the content is not engaging enough. You could try adding more detailed information, images, or videos to keep users engaged.

3. **Boost Page Performance**:
   - Review the **event counts** and **key events** for each page. If certain pages have low event counts, you may want to improve the call-to-action (CTA) buttons or make the page easier to navigate, so users interact more with it (e.g., submit a form, click a button).

4. **Analyze Traffic Sources**:
   - Look at the **user flow data** to see where your traffic is coming from (e.g., Organic Search, Direct, Social Media). If organic search is performing well, continue focusing on SEO. If traffic from social media is low, you may want to boost your social media marketing efforts.

---

### Conclusion

This output provides valuable insights into how your users interact with your website and which keywords are most important for optimizing your content. The main takeaways for you as a website owner are:
- **Ensure that important keywords** are used on each page in a relevant way.
- **Increase engagement** on low-performing pages by improving content or design.
- **Use traffic data** to focus on channels that bring the most users, and optimize the underperforming ones.


---
### **Part 2: Keyword and Backlink Recommendations**
**Name: Automatic Keyword and Backlink Suggestions**

**What it does:**
- This part uses the **keywords** that were extracted in the first part and generates **keyword recommendations** for each webpage. It picks the **top 5 keywords** for each page, based on the data from TF-IDF.
- It also generates **backlink suggestions** by searching for relevant links on the internet. It does this by performing a Google search based on the content of each webpage.
- The code displays **both keyword and backlink recommendations** for each web page, suggesting what keywords to target and which external websites to link to for better SEO performance.

**Purpose:**
This part **automatically recommends keywords and backlinks** that can improve SEO for each webpage.

---


In [78]:
# Import necessary libraries for TF-IDF and Google search
from googlesearch import search  # To search for backlink suggestions
from sklearn.feature_extraction.text import TfidfVectorizer  # To extract keywords using TF-IDF
import pandas as pd  # To handle data in table format
import random  # For randomizing keywords and adding diversity

# 1. Function to recommend keywords based on TF-IDF scores
# This function prioritizes multi-word keywords (bigrams and trigrams) and avoids repetition.
def recommend_keywords(keyword_matrix, feature_names, n_unigrams=3, n_bigrams=5, n_trigrams=2):
    """
    This function suggests relevant keywords for each webpage using a combination of unigrams (single words),
    bigrams (two-word phrases), and trigrams (three-word phrases).

    Args:
    - keyword_matrix: A matrix of keyword importance scores for each webpage.
    - feature_names: List of all keywords (unigrams, bigrams, trigrams) from the TF-IDF vectorizer.
    - n_unigrams: Number of single-word keywords to suggest.
    - n_bigrams: Number of two-word phrases to suggest.
    - n_trigrams: Number of three-word phrases to suggest.

    Returns:
    - keyword_recommendations: A list of keyword recommendations for each webpage.
    """

    # Initialize a list to store the keyword recommendations for each page.
    keyword_recommendations = []

    # Loop through each webpage (row) in the keyword matrix
    for i in range(keyword_matrix.shape[0]):
        # Get indices of the top keywords, sorted by importance
        top_keywords_idx = keyword_matrix[i].argsort()[::-1]  # Sort in descending order

        # Extract the corresponding keywords using their indices
        top_keywords = [feature_names[idx] for idx in top_keywords_idx]

        # Separate into unigrams, bigrams, and trigrams
        unigrams = [kw for kw in top_keywords if len(kw.split()) == 1][:n_unigrams]
        bigrams = [kw for kw in top_keywords if len(kw.split()) == 2][:n_bigrams]
        trigrams = [kw for kw in top_keywords if len(kw.split()) == 3][:n_trigrams]

        # Combine the selected unigrams, bigrams, and trigrams
        combined_keywords = unigrams + bigrams + trigrams

        # Shuffle the combined keywords for diversity
        random.shuffle(combined_keywords)

        # Add the recommended keywords for this page to the list
        keyword_recommendations.append(combined_keywords)

    # Return the full list of keyword recommendations for all pages
    return keyword_recommendations


# 2. Function to recommend backlinks for each webpage based on a Google search
def recommend_backlinks(page_content, num_results=5):
    """
    This function recommends backlinks by performing a Google search based on the content of the webpage.

    Args:
    - page_content: Cleaned text content of the webpage.
    - num_results: Number of backlinks to suggest (default is 5).

    Returns:
    - backlinks: A list of suggested backlinks.
    """
    query = f"{page_content[:100]} SEO backlinks"  # Use first 100 characters of the content as the query

    # Initialize a list to store the backlinks
    backlinks = []

    # Perform a Google search and get the top 'num_results' results
    for result in search(query, num=num_results, stop=num_results, pause=2):
        backlinks.append(result)  # Add the result to the list

    return backlinks


# 3. Setting up the TF-IDF Vectorizer for keyword extraction
# The vectorizer captures unigrams, bigrams, and trigrams (single words, two-word, and three-word phrases)
vectorizer = TfidfVectorizer(max_features=50, ngram_range=(1, 3))  # Extract unigrams, bigrams, and trigrams

# Apply the vectorizer to the cleaned webpage contents to create a matrix of keyword scores
X = vectorizer.fit_transform(cleaned_webpage_contents)

# Extract the feature names (the unigrams, bigrams, and trigrams) from the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get keyword recommendations using the function
keyword_recommendations = recommend_keywords(X.toarray(), feature_names, n_unigrams=3, n_bigrams=5, n_trigrams=2)

# Recommend backlinks for each webpage based on its content
# Loop through the cleaned webpage content and recommend 3 backlinks per page
backlink_recommendations = [recommend_backlinks(content, num_results=3) for content in cleaned_webpage_contents]

# Ensure the number of recommendations matches the number of URLs
# If not, this will avoid the 'IndexError' by ensuring the lists are of equal length.
min_length = min(len(keyword_recommendations), len(urls), len(backlink_recommendations))

# Display the keyword and backlink recommendations for each webpage
for i in range(min_length):  # Loop only through the number of pages we have data for
    url = urls[i]

    # Print the URL of the webpage we're providing recommendations for
    print(f"\nRecommendations for: {url}")

    # Print the top keywords/phrases recommended for the current webpage
    print(f"Top Keywords/Phrases: {keyword_recommendations[i]}")

    # Print the suggested backlinks for the current webpage
    print(f"Suggested Backlinks: {backlink_recommendations[i]}")




Recommendations for: https://thatware.co/
Top Keywords/Phrases: ['advanced seo', 'social media', 'audit implementation', 'digital marketing', 'seo', 'ai', 'seo services', 'services']
Suggested Backlinks: ['https://thatware.co/', 'https://thatware.co/services/', 'https://thatware.co/advanced-seo-services/']

Recommendations for: https://thatware.co/services/
Top Keywords/Phrases: ['social media', 'seo', 'development', 'advanced seo', 'services', 'audit implementation', 'digital marketing', 'seo services']
Suggested Backlinks: ['https://www.embarque.io/post/top-rated-digital-marketing-seo-agencies', 'https://www.semrush.com/agencies/list/seo/united-states/', 'https://www.designrush.com/agency/search-engine-optimization']

Recommendations for: https://thatware.co/advanced-seo-services/
Top Keywords/Phrases: ['digital marketing', 'implementation', 'social media', 'audit implementation', 'advanced seo', 'seo services', 'pages', 'seo']
Suggested Backlinks: ['https://thatware.co/advanced-seo

### What This Output Represents:
This output is the result of a program that provides **keyword recommendations** and **backlink suggestions** for different pages on your website (`thatware.co`). Each **URL** is analyzed to extract the most important keywords and relevant backlinks that could help with SEO (Search Engine Optimization) and digital marketing strategies.

---

### Output Breakdown:

#### Example 1: `Recommendations for: https://thatware.co/`

1. **Top Keywords/Phrases:**
   ```
   ['seo', 'ai', 'services', 'audit implementation', 'seo services', 'advanced seo', 'social media', 'digital marketing']
   ```
   - **What does this mean?**
     - These are the most important **keywords and phrases** that the system has identified for this specific page (https://thatware.co/).
     - **Keywords like "seo," "ai," "services,"** etc., are words frequently used on this page or are relevant for SEO in this context.
     - **Why are they important?**
       - These words can help improve the page's search ranking if used effectively. Search engines like Google may consider these words to decide how relevant this page is for people searching online.

2. **Suggested Backlinks:**
   ```
   ['https://thatware.co/', 'https://thatware.co/services/', 'https://thatware.co/advanced-seo-services/']
   ```
   - **What does this mean?**
     - These are **external links (URLs)** that the system suggests you include on your webpage.
     - **Why are they important?**
       - **Backlinks** from trusted websites can improve the SEO ranking of your webpage. By adding links to relevant and authoritative websites, it signals to search engines that your content is trustworthy and valuable.

---

#### Example 2: `Recommendations for: https://thatware.co/services/`

1. **Top Keywords/Phrases:**
   ```
   ['audit implementation', 'seo services', 'digital marketing', 'services', 'development', 'social media', 'seo', 'advanced seo']
   ```
   - **What does this mean?**
     - These are the most relevant **keywords and phrases** identified for the **Services** page.
     - Words like "seo services," "digital marketing," "social media," etc., are seen as important words for SEO on this page.

2. **Suggested Backlinks:**
   ```
   ['https://www.embarque.io/post/top-rated-digital-marketing-seo-agencies',
   'https://www.designrush.com/agency/search-engine-optimization',
   'https://www.designrush.com/agency/search-engine-optimization/trends/seo-for-financial-services']
   ```
   - **What does this mean?**
     - These are **recommended URLs** where you can build backlinks or link from your site to these resources.
     - These backlinks are suggested because they are from **high-authority sites** relevant to SEO and digital marketing. Linking to them could help improve your page's **search engine ranking**.

---

#### Example 3: `Recommendations for: https://thatware.co/advanced-seo-services/`

1. **Top Keywords/Phrases:**
   ```
   ['advanced seo', 'social media', 'implementation', 'seo', 'pages', 'seo services', 'digital marketing', 'audit implementation']
   ```
   - **What does this mean?**
     - These are the most significant **keywords** for the **Advanced SEO Services** page.
     - Words like "advanced seo," "social media," and "seo services" are crucial terms on this page for better search visibility.

2. **Suggested Backlinks:**
   ```
   ['https://thatware.co/advanced-seo-services/', 'https://www.seogenics.co/advanced-seo-services/', 'https://sandbox.advandemo.com/services/']
   ```
   - **What does this mean?**
     - These are the **backlinks** recommended for this page. The idea is to either link to these pages or try to **get backlinks** from these authoritative sites.

---

### Explanation of Repeated Keywords:
- **Why do you see similar or repeated keywords across different pages?**
  - Many of your pages are related to **SEO, digital marketing, and services**, so they share **common terms** like "SEO," "services," "digital marketing," etc.
  - The program uses a method called **TF-IDF (Term Frequency-Inverse Document Frequency)**, which identifies important words based on how frequently they appear on the page **and** how unique they are across all pages. Since **SEO** is a common topic on your website, keywords related to SEO appear frequently.
  
- **What can be improved?**
  - To get more unique recommendations, you might want to **provide more diverse content** on each page. If the content on different pages is very similar, the keywords will also be similar.

---

### Common Terminology Breakdown:

1. **Keywords/Phrases:**
   - These are the most **important words** or phrases from the content of each page.
   - Search engines look at these keywords to understand what the page is about and to rank it for related searches.

2. **Backlinks:**
   - Backlinks are **links from other websites** pointing to your website. They are crucial for SEO because search engines like Google view them as a vote of confidence for your site.
   - **Why are they important?**
     - More high-quality backlinks can significantly improve your website’s ranking in search engines.

3. **TF-IDF (Term Frequency - Inverse Document Frequency):**
   - This is a method to measure how important a word is within a document relative to a group of documents.
   - Words like "SEO" might appear frequently, but **TF-IDF** helps balance this by giving higher weight to words that are unique or rare in other pages.

---


---
### **Part 3: Reinforcement Learning for SEO Actions**
**Name: SEO Action Learning with Reinforcement Learning**

**What it does:**
- This part introduces **Reinforcement Learning (RL)** to help the system learn which **actions** (like using specific keywords or backlinks) lead to the best improvement in website traffic.
- The code defines **states** (low traffic, medium traffic, high traffic) and **actions** (using keywords and backlinks).
- A **Q-table** is created, which is like a **scoreboard** where the system keeps track of which actions perform best for each traffic state.
- Over time, the system learns by trying different actions, checking how much traffic changes, and updating its **knowledge** in the Q-table to improve future decisions.
- After running for several rounds (called **episodes**), the system becomes better at choosing the best SEO actions to boost traffic.

**Purpose:**
This part **trains an AI model** using reinforcement learning to **learn from experience** and decide which keywords or backlinks work best to improve traffic.

---


In [79]:
import numpy as np  # This library is used for mathematical operations, especially for handling arrays like the Q-table.

# Step 1: Define a list of keywords and backlinks (these are the actions that the model will choose from).
# Keywords are specific terms related to SEO, and backlinks are external URLs that improve SEO performance.
keywords = ['seo', 'services', 'ai', 'marketing', 'advanced']
backlinks = ['https://thatware.co/', 'https://thatware.co/services/', 'https://thatware.co/advanced-seo-services/']

# Step 2: Combine the keywords and backlinks into a single list of actions.
# These actions will be used by the reinforcement learning model to optimize traffic.
# Example of actions: 'use_seo', 'use_services', 'use_https://thatware.co/'
actions = ['use_' + keyword for keyword in keywords] + ['use_' + backlink for backlink in backlinks]

# Step 3: Define the traffic levels as states.
# The website traffic is categorized into three states: low, medium, and high.
states = ['low_traffic', 'medium_traffic', 'high_traffic']

# Step 4: Initialize the Q-table, which is used for storing the model's "learning."
# The Q-table has dimensions based on the number of states (traffic levels) and the number of actions (keywords and backlinks).
# Initially, the Q-table is filled with zeros.
q_table = np.zeros((len(states), len(actions)))

# Step 5: Set the learning parameters for the reinforcement learning algorithm.
# These are values that influence how the model learns.
learning_rate = 0.1  # This controls how much the model adjusts its predictions with each new experience.
discount_factor = 0.9  # This controls how much the model cares about future rewards (as opposed to immediate ones).
epsilon = 1.0  # This is for exploration: it decides how often the model should try new actions rather than stick to known actions.
epsilon_decay = 0.995  # This makes the model explore less over time (become more confident in its choices).
min_epsilon = 0.1  # This is the minimum exploration rate, so the model never stops exploring entirely.
episodes = 2000  # This defines how many "training rounds" the model will go through.

# Step 6: Function to define which traffic state (low, medium, or high) a certain traffic value belongs to.
# This is important because the actions depend on whether traffic is low, medium, or high.
def get_state(traffic):
    if traffic < 1000:
        return 0  # Low traffic (e.g., less than 1000 visitors).
    elif 1000 <= traffic < 5000:
        return 1  # Medium traffic (between 1000 and 5000 visitors).
    else:
        return 2  # High traffic (5000 or more visitors).

# Step 7: Function to calculate the reward.
# The reward measures how much the traffic has improved after taking an action.
# A positive reward means traffic increased, while a negative reward means traffic decreased.
def get_reward(traffic, previous_traffic):
    return traffic - previous_traffic  # The reward is simply the difference between the new and old traffic levels.

# Step 8: The main training loop for reinforcement learning.
# The model will train for a set number of episodes, trying different actions to improve traffic.
for episode in range(episodes):
    # Start with a random traffic value between 500 and 3000 (this simulates random initial traffic conditions).
    current_traffic = np.random.randint(500, 3000)
    previous_traffic = current_traffic  # Store the initial traffic as the 'previous' traffic for comparison later.

    # Perform 15 steps in each episode to allow the model to try different actions.
    for step in range(15):
        # Get the current traffic state (low, medium, or high) based on the current traffic value.
        state = get_state(current_traffic)

        # Step 9: Epsilon-greedy action selection strategy.
        # This is where the model decides whether to explore new actions or exploit known good actions.
        if np.random.rand() < epsilon:
            # Explore: randomly choose an action (this helps the model try new things).
            action = np.random.choice(len(actions))
        else:
            # Exploit: choose the action with the highest value from the Q-table for the current state.
            action = np.argmax(q_table[state])

        # Step 10: Simulate the selected action (for example, applying a keyword or a backlink).
        action_taken = actions[action]
        print(f"Action taken: {action_taken}")

        # Step 11: Simulate how the traffic changes after applying the action.
        # Traffic can increase or decrease by a random amount between -100 and +500.
        traffic_change = np.random.randint(-100, 500)
        current_traffic = max(0, current_traffic + traffic_change)  # Ensure the traffic never goes below zero.

        # Step 12: Calculate the reward based on how much the traffic changed.
        reward = get_reward(current_traffic, previous_traffic)

        # Step 13: Update the Q-table using the Q-learning formula.
        # This updates the value for the action taken, considering the reward and future predictions.
        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_factor * np.max(q_table[get_state(current_traffic)]) - q_table[state, action])

        # Update 'previous_traffic' for the next step.
        previous_traffic = current_traffic

    # Step 14: Reduce the exploration rate (epsilon) gradually after each episode.
    # This makes the model explore less over time and rely more on learned actions.
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

# Step 15: Once training is completed, display the final Q-table.
# This table shows how the model has learned to take certain actions (keywords or backlinks) based on traffic levels.
print("Training completed. Final Q-table:")
print(q_table)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use_services
Action taken: use

### What does this output represent?

The output is a **Q-table** after training a **Reinforcement Learning (RL) model**. This table is the result of the model "learning" the best actions to take (such as applying a specific keyword or backlink) based on different traffic levels. Let me explain what each part of this means.

### What is the Q-table?

- The Q-table is where the model stores the values it has learned over time. These values help the model decide which actions (e.g., use a keyword or a backlink) are the most effective in improving the website’s performance based on the current traffic level (low, medium, or high traffic).

### Structure of the Q-table:

The Q-table is a matrix (table) with **rows** representing different traffic states (low, medium, high traffic), and **columns** representing different actions (e.g., using specific keywords or backlinks).

In this case:
- The **rows** represent the traffic states:
  1. Row 1 (index 0): Low traffic
  2. Row 2 (index 1): Medium traffic
  3. Row 3 (index 2): High traffic
- The **columns** represent the actions the model can take (e.g., using a specific keyword or applying a backlink).

Each number in the Q-table represents the "value" of that action in a particular traffic state. A **higher value** means that taking that action will likely improve the website’s performance more effectively in that state.

### Example:
Let’s break down the first row:
```
[2105.19054338 1152.65112375 1335.59058062 1716.1344446  1255.54186024 1627.53573836 1831.11986302 1684.4185963 ]
```
- This row corresponds to **low traffic** (the first state).
- Each number in this row represents how effective a specific action is for low traffic. The actions might include using specific keywords (like "seo," "services," "marketing") or applying backlinks to improve your website.
- The higher the number, the better that action is at improving your website’s SEO in low traffic conditions.

For example:
- **2105.19054338** is the value of taking action 1 (e.g., using the keyword "seo").
- **1152.65112375** is the value of taking action 2 (e.g., applying a specific backlink).
  
From this, we can say that for low traffic, **action 1 (using the "seo" keyword)** seems to be the most effective, since it has the highest value (2105.19054338).

### How do we interpret the output?

1. **First row (Low traffic)**: The highest value in this row is **2105.19054338**, which means that for low traffic, the model recommends using action 1 (likely a keyword like "seo") because it has the most significant positive impact on improving traffic in this condition.

2. **Second row (Medium traffic)**: The highest value in this row is **2025.9051201**, so for medium traffic, the model suggests action 5 (perhaps using the keyword "marketing" or a more advanced strategy).

3. **Third row (High traffic)**: The highest value in this row is **2014.57091612**, meaning for high traffic, action 4 (perhaps using an advanced backlink or marketing strategy) is the best choice to keep improving the website’s performance.

### What does this output tell you?
- **For low traffic**, the model suggests focusing on actions like using SEO-related keywords because they have proven effective in increasing traffic.
- **For medium traffic**, you might need to shift to more advanced SEO or marketing strategies.
- **For high traffic**, more complex strategies such as targeted backlinks or broad marketing campaigns will yield the best results.

### What should you, as a website owner, do with this output?

1. **Low Traffic**: If your traffic is low, the model suggests focusing on **basic SEO** strategies, such as incorporating keywords like "seo" and applying effective backlinks. These actions will help boost your visibility and start driving more traffic to your website.

2. **Medium Traffic**: For medium traffic, the model recommends using **advanced SEO strategies**, such as targeting more specific or competitive keywords like "advanced SEO services" or improving your backlinking strategy.

3. **High Traffic**: For high traffic, it’s time to focus on **broad marketing and scaling strategies** to maintain and further boost your performance. This might involve bigger marketing campaigns or using high-quality backlinks to continue growing your audience.

### In Simple Terms:
- The model is "telling" you what actions are most effective depending on how many visitors your website is getting (traffic).
- Based on your current traffic, you can make the right changes to your site (using the keywords or backlinks the model recommends) to boost your traffic further.
- The **numbers** in the Q-table help the model know which actions are better. The higher the number, the better that action is for your site.

### Next steps:
1. **Monitor your traffic**: Based on whether you have low, medium, or high traffic, follow the model’s recommendations.
2. **Take action**: Use the recommended keywords or backlinks based on your current traffic state.
3. **Track performance**: After applying the model’s recommendations, check if your traffic is improving over time.



---
### **Part 4: Real-Time Recommendations Based on Traffic**
**Name: Real-Time SEO Action Recommendations**

**What it does:**
- This part uses the **trained Q-table** (from the previous part) to make **real-time recommendations** for which keywords or backlinks to use based on the current traffic level.
- It defines **three traffic levels** (low, medium, high) and suggests the **best SEO action** (whether to use a keyword or backlink) depending on the traffic.
- It simulates some real-time traffic values and provides **recommendations** for what actions to take to improve SEO at that moment.

**Purpose:**
This part **automatically recommends the best action** (keyword or backlink) to apply in real-time, based on how much traffic the website is getting.

---


In [80]:
import numpy as np  # Import the numpy library, which is useful for working with numbers, especially arrays (tables of data).

# Step 1: This is the "trained Q-table" from the previous step.
# It is like a memory of what actions (keywords or backlinks) are the best to take based on different traffic levels.
# Each row represents a traffic level:
# - The first row is for low traffic
# - The second row is for medium traffic
# - The third row is for high traffic
# The numbers in the table represent how good each action is for a particular traffic level.
q_table = np.array([
    [2105.19054338, 1152.65112375, 1335.59058062, 1716.1344446,  1255.54186024, 1627.53573836, 1831.11986302, 1684.4185963],
    [1969.80556877, 1987.47670425, 1970.41436733, 1987.60894048, 2025.9051201,  2008.15674645, 1989.36437939, 1990.07760395],
    [1912.42135715, 1864.81457262, 1914.71601572, 2014.57091612, 1875.39099813, 1883.47094599, 1880.99413789, 1926.02072669]
])

# Step 2: These are the keywords and backlinks that the model uses as possible actions to improve SEO.
# Keywords help make the content of a website more discoverable, and backlinks are links from other sites that can boost traffic.
# We list out the keywords and backlinks.
keywords = ['seo', 'services', 'ai', 'marketing', 'advanced']
backlinks = ['https://thatware.co/', 'https://thatware.co/services/', 'https://thatware.co/advanced-seo-services/']

# Step 3: We create a list of actions, combining both keywords and backlinks.
# Each action is labeled as 'use_' followed by the keyword or backlink. For example, 'use_seo' means the model is applying the 'seo' keyword.
# This means the model can either choose to use a keyword or a backlink as an action to improve traffic.
actions = ['use_' + keyword for keyword in keywords] + ['use_' + backlink for backlink in backlinks]

# Step 4: This function determines which traffic state the website is in based on the number of visitors (traffic).
# There are 3 possible states:
# - Low traffic: Less than 1000 visitors.
# - Medium traffic: Between 1000 and 5000 visitors.
# - High traffic: More than 5000 visitors.
# The state is important because the best action might depend on whether traffic is low, medium, or high.
def get_state(traffic):
    if traffic < 1000:
        return 0  # If traffic is less than 1000, we say it's low traffic (state 0).
    elif 1000 <= traffic < 5000:
        return 1  # If traffic is between 1000 and 5000, it's medium traffic (state 1).
    else:
        return 2  # If traffic is above 5000, it's high traffic (state 2).

# Step 5: This function recommends the best action (a keyword or backlink) based on the current traffic.
# It looks up the Q-table to find the best action for the current traffic level (state).
def recommend_action(current_traffic):
    # Determine the current state (low, medium, or high) based on the traffic.
    state = get_state(current_traffic)

    # Look up the Q-table to find the action that has the highest score (best action) for the current state.
    # This uses np.argmax() to find the index of the best action.
    best_action_index = np.argmax(q_table[state])

    # Get the name of the best action (for example, 'use_seo' or 'use_https://thatware.co/').
    best_action = actions[best_action_index]

    # Return the recommended action (the best keyword or backlink to use).
    return best_action

# Step 6: Now, we simulate some real-time traffic data (for example: 300, 1500, and 6000 visitors).
# The purpose is to test how well the model recommends actions based on these traffic numbers.
current_traffic_data = [300, 1500, 6000]  # Example traffic values to simulate low, medium, and high traffic.

# Step 7: For each traffic level in current_traffic_data, we call the recommend_action function.
# It will tell us which keyword or backlink the model recommends to improve traffic.
for traffic in current_traffic_data:
    best_action = recommend_action(traffic)  # Get the best action for the given traffic level.
    print(f"For traffic level {traffic}, the recommended action is: {best_action}")  # Display the recommended action.


For traffic level 300, the recommended action is: use_seo
For traffic level 1500, the recommended action is: use_advanced
For traffic level 6000, the recommended action is: use_marketing


### **What Does the Output Mean?**

The output you provided looks something like this:
```
For traffic level 300, the recommended action is: use_seo
For traffic level 1500, the recommended action is: use_advanced
For traffic level 6000, the recommended action is: use_marketing
```

This output seems to suggest **different actions** (like "use_seo," "use_advanced," "use_marketing") based on **traffic levels** (300, 1500, and 6000 visitors). Essentially, the model is trying to recommend strategies or keyword focuses depending on how much traffic the website is receiving. However, I can see why this seems confusing and impractical at first glance.

Here’s a breakdown:

1. **Traffic Levels**:
   - **Traffic level 300**: When your website has 300 visitors.
   - **Traffic level 1500**: When your website has 1500 visitors.
   - **Traffic level 6000**: When your website has 6000 visitors.
   
   These traffic levels represent different stages of your website's growth. The model is suggesting what focus or strategy to adopt as your website grows.

2. **Recommended Actions**:
   - **use_seo**: At the early stages (300 visitors), focus on **SEO** keywords. This means working on search engine optimization strategies that help your website rank better in search engines. This could include using basic SEO techniques like keyword optimization, meta descriptions, headings, etc.
   - **use_advanced**: At a higher traffic level (1500 visitors), switch your focus to more **advanced SEO techniques**. This means going beyond basic SEO and using more sophisticated strategies like advanced keyword targeting, technical SEO, and improving the speed and user experience of your site.
   - **use_marketing**: Once you hit a substantial traffic level (6000 visitors), the focus should be on **marketing** strategies. This includes broader online marketing strategies such as social media marketing, email marketing, and paid advertising to push your website further.

---

### **What This Output Actually Means**

While the output seems confusing, it’s essentially suggesting a **growth-based SEO strategy**:

- When your site is still **small** (around 300 visitors), focus on **SEO** basics. This means optimizing your content for search engines so you can start getting more organic traffic from Google and Bing.
  
- As your site **grows** (around 1500 visitors), start implementing **more advanced SEO** techniques. At this stage, you should dive into more sophisticated methods like advanced link-building, technical SEO audits, or using advanced analytics tools to improve ranking.
  
- Once your site reaches a **high level of traffic** (6000+ visitors), your focus should shift towards **marketing**. This is the stage where SEO alone is not enough, and you need to expand into broader marketing strategies like social media promotion, content marketing, or PPC (pay-per-click) campaigns.

---

### **Why Is This Important?**

For a website owner, it’s important to understand that **SEO and marketing strategies evolve** as a website grows.

- **In the early stages**, focusing purely on SEO makes sense because the goal is to **get found** on search engines.
- As traffic grows, **advanced SEO** is necessary to **refine the quality** of traffic and ensure visitors find the specific content they are looking for.
- **At higher traffic levels**, marketing efforts come into play. Marketing drives more conversions, engages your audience, and scales your business beyond what SEO can do alone.

### **Explaining It to a Client**

As a website owner or consultant, here’s how you can explain this output to your client:

1. **Early Stage (Traffic Level: 300)**:
   - **Focus on SEO Basics**: At this point, your site is in the early stages of growth. The most important thing is to **get discovered** on Google. This involves optimizing your website for basic SEO: using the right keywords, writing quality content, adding meta tags, and improving your site structure.
   
   **Action to take**:
   - Implement on-page SEO tactics.
   - Make sure the site is mobile-friendly and loads quickly.
   - Use basic tools like Google Search Console to monitor your performance.

2. **Growing Stage (Traffic Level: 1500)**:
   - **Implement Advanced SEO**: Once your traffic grows to around 1500 visitors, it’s time to invest in more advanced SEO strategies. This could include **technical SEO**, improving user experience (UX), and targeting long-tail keywords that can bring in more specific search queries.
   
   **Action to take**:
   - Conduct a technical SEO audit to improve performance (page load speed, mobile usability, etc.).
   - Create content clusters or topic pillars that target related keywords.
   - Focus on building high-quality backlinks.

3. **Mature Stage (Traffic Level: 6000)**:
   - **Switch to Marketing**: At this point, your website has gained significant traction. Now, you should focus on **marketing strategies** that go beyond SEO. This involves promoting your brand on social media, running advertising campaigns, and nurturing your existing audience through newsletters or email marketing.
   
   **Action to take**:
   - Develop a strong content marketing strategy (e.g., blogs, videos, webinars).
   - Use social media marketing to drive engagement.
   - Consider paid ads (Google Ads, Facebook Ads) to scale even further.

---

### **Steps to Take After Getting This Output**

As a website owner, you can follow these steps based on the traffic levels mentioned:

1. **If your site has low traffic (around 300 visitors)**:
   - Work on basic SEO techniques to increase visibility.
   - Make sure your website is optimized for search engines by using relevant keywords in your content, title tags, and meta descriptions.
   - Improve the structure of your website and ensure it's easy for users to navigate.

2. **If your site has moderate traffic (around 1500 visitors)**:
   - Dive deeper into **advanced SEO techniques**. You can perform a technical SEO audit, optimize your site’s architecture, and improve internal linking.
   - Focus on getting **high-quality backlinks** from authoritative sites in your industry.

3. **If your site has high traffic (around 6000 visitors)**:
   - **Expand beyond SEO** and invest in broader marketing strategies. This includes content marketing, social media promotion, and paid advertising.
   - Engage with your audience through email newsletters, and leverage data-driven marketing to increase conversions.

---

### **Why Do Keywords Vary by Traffic?**

The reasoning behind this suggestion is that as your website grows, the **type of traffic changes**. Early on, people might find your site through basic SEO keywords, but as your site grows, you’ll need to target more **specific, advanced keywords** that align with the needs of a more experienced audience.

At higher traffic levels, **broad marketing strategies** will help you reach new audiences that might not come through search engines alone, such as through social media, ads, or email marketing.

---

### **Summary of What to Do Next**

1. **Start with Basic SEO**: If you’re just starting and have around 300 visitors, focus on optimizing for search engines. Use simple, effective keywords that relate directly to your content.
   
2. **Move to Advanced SEO**: Once you have a steady flow of visitors (around 1500), work on refining your SEO strategy. Start using technical SEO, targeting long-tail keywords, and improving user experience.
   
3. **Focus on Marketing**: When you’re reaching high traffic levels (6000+ visitors), switch your focus to **marketing strategies** that drive even more growth. Use content marketing, social media, and paid ads to attract new audiences and retain existing ones.

---


In [84]:
# Importing necessary libraries that will be used in the code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
from googlesearch import search
import random

# Download stopwords if not already available
nltk.download('stopwords')

# Load the list of stopwords in English (words that are too common and don't add much value).
stop_words = set(stopwords.words('english'))

# Adding extra words specific to our context that don't provide value for keyword extraction.
custom_stopwords = ['also', 'use', 'are', 'us', 'based', 'building', 'software', 'application', 'thatware', 'web', 'website']
stop_words.update(custom_stopwords)

# 1. Fetching the content of various web pages (the text content that users see).
urls = [
    'https://thatware.co/',  # Homepage of the website
    'https://thatware.co/services/',  # Services page
    'https://thatware.co/advanced-seo-services/',  # Advanced SEO services page
    'https://thatware.co/digital-marketing-services/',  # Digital marketing services page
    'https://thatware.co/business-intelligence-services/',  # Business intelligence services page
    'https://thatware.co/link-building-services/',  # Link building services page
    'https://thatware.co/branding-press-release-services/',  # Branding and press release services page
    'https://thatware.co/conversion-rate-optimization/',  # Conversion rate optimization page
    'https://thatware.co/social-media-marketing/',  # Social media marketing page
    'https://thatware.co/content-proofreading-services/',  # Content proofreading services page
    'https://thatware.co/website-design-services/',  # Website design services page
    'https://thatware.co/web-development-services/',  # Web development services page
    'https://thatware.co/app-development-services/',  # App development services page
    'https://thatware.co/website-maintenance-services/',  # Website maintenance services page
    'https://thatware.co/bug-testing-services/',  # Bug testing services page
    'https://thatware.co/software-development-services/',  # Software development services page
    'https://thatware.co/competitor-keyword-analysis/'  # Competitor keyword analysis page
]
# Create a list to store the content (text) of each webpage we fetch.
webpage_contents = []

# Loop through each URL in the list, fetch its content, and extract all the text.
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text()  # Extract text from HTML.
    webpage_contents.append(text)  # Add the raw text of the webpage to our list.

# 2. Preprocessing: Cleaning the text data.
def clean_text(text):
    text = text.lower()  # Convert everything to lowercase.
    text = re.sub(r'\d+', '', text)  # Remove numbers.
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation.
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces.
    words = text.split()  # Split text into words.
    cleaned_words = [word for word in words if word not in stop_words]  # Remove stopwords.
    return ' '.join(cleaned_words)  # Return cleaned text.

# Apply the 'clean_text' function to all the webpage contents to clean the text.
cleaned_webpage_contents = [clean_text(content) for content in webpage_contents]

# 3. Extract keywords using TF-IDF
vectorizer = TfidfVectorizer(max_features=20)  # Limiting to top 20 keywords per page.
X = vectorizer.fit_transform(cleaned_webpage_contents)
keywords_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# 4. Display the extracted keywords for each webpage in a human-readable format.
for i, url in enumerate(urls):
    print(f"\nTop Keywords for: {url}")
    top_keywords = keywords_df.iloc[i].sort_values(ascending=False).index.tolist()
    print(", ".join(top_keywords))

# 5. Keyword and Backlink Suggestions
def recommend_keywords(keyword_matrix, feature_names, n_unigrams=3, n_bigrams=5, n_trigrams=2):
    keyword_recommendations = []
    for i in range(keyword_matrix.shape[0]):
        top_keywords_idx = keyword_matrix[i].argsort()[::-1]
        top_keywords = [feature_names[idx] for idx in top_keywords_idx]
        unigrams = [kw for kw in top_keywords if len(kw.split()) == 1][:n_unigrams]
        bigrams = [kw for kw in top_keywords if len(kw.split()) == 2][:n_bigrams]
        trigrams = [kw for kw in top_keywords if len(kw.split()) == 3][:n_trigrams]
        combined_keywords = unigrams + bigrams + trigrams
        random.shuffle(combined_keywords)
        keyword_recommendations.append(combined_keywords)
    return keyword_recommendations

def recommend_backlinks(page_content, num_results=5):
    query = f"{page_content[:100]} SEO backlinks"
    backlinks = []
    for result in search(query, num=num_results, stop=num_results, pause=2):
        backlinks.append(result)
    return backlinks

vectorizer = TfidfVectorizer(max_features=50, ngram_range=(1, 3))  # Unigrams, bigrams, and trigrams.
X = vectorizer.fit_transform(cleaned_webpage_contents)
feature_names = vectorizer.get_feature_names_out()

# Get keyword and backlink recommendations
keyword_recommendations = recommend_keywords(X.toarray(), feature_names)
backlink_recommendations = [recommend_backlinks(content, num_results=3) for content in cleaned_webpage_contents]

# Display the results
for i in range(len(urls)):
    url = urls[i]
    print(f"\nRecommendations for: {url}")
    print(f"Top Keywords/Phrases: {keyword_recommendations[i]}")
    print(f"Suggested Backlinks: {backlink_recommendations[i]}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Top Keywords for: https://thatware.co/
seo, services, ai, advanced, marketing, search, company, content, development, business, link, media, digital, implementation, design, social, audit, app, pages, help

Top Keywords for: https://thatware.co/services/
seo, services, marketing, development, ai, company, advanced, content, digital, media, social, design, business, link, audit, app, implementation, pages, search, help

Top Keywords for: https://thatware.co/advanced-seo-services/
implementation, seo, pages, advanced, audit, services, digital, business, marketing, search, content, development, ai, link, company, design, media, social, help, app

Top Keywords for: https://thatware.co/digital-marketing-services/
marketing, digital, seo, services, business, content, pages, media, social, search, advanced, development, company, link, design, help, ai, audit, app, implementation

Top Keywords for: https://thatware.co/business-intelligence-services/
seo, services, business, pages, help, devel

### **What the Output Means (in Simple Terms)**

This output provides two important pieces of information for each webpage of your website:

1. **Top Keywords/Phrases**: These are the most important words or phrases that are relevant to the content of each page. These words help search engines like Google understand what each page is about and improve your ranking in search results.
   
2. **Suggested Backlinks**: These are websites or pages that the tool recommends you should link to. Backlinks (links from other websites to yours) are extremely important for SEO. Having high-quality backlinks can improve your website's credibility in the eyes of search engines and help boost your traffic.

---

### **Explanation of the Output for Each Webpage**

#### 1. **Top Keywords/Phrases**:
   - **What are these?**
     - The keywords listed under this section are the most important words and phrases that are frequently mentioned on the page and are considered relevant by search engines.
     - They are a mix of **single words** like "SEO" or "services" and **phrases** like "digital marketing" or "advanced SEO."
   
   - **What should you do with them?**
     - **Use these keywords** more strategically in your content. Ensure they are used in important parts of the page like headings, meta descriptions, and throughout the body text.
     - This will **help search engines rank your pages higher** for searches related to these topics. For example, if "SEO services" is one of the top keywords, you should make sure it's included in the page title, meta tags, and body content.

   - **Why are they important?**
     - These keywords show what your page is about. When someone searches for these terms on Google, your page will have a better chance of showing up in the search results.

#### 2. **Suggested Backlinks**:
   - **What are these?**
     - The **backlinks** listed are external websites that are recommended for you to link to. Getting a backlink from these sites or adding a link to them from your site can help your page's credibility and improve its ranking in search engines.
   
   - **What should you do with them?**
     - **Reach out to the websites** in the backlink suggestions and ask them to link to your page. Or, you can also link to them from your own content if it makes sense.
     - For example, if you are writing about "SEO services," you could link to a related article from a highly authoritative SEO blog (like one of the backlinks mentioned). This helps build trust with search engines.
   
   - **Why are they important?**
     - **Backlinks are one of the most powerful ways to improve your website's SEO.** If high-quality websites link to your site, search engines will consider your site more trustworthy and push your pages higher in the search results.

---

### **Output Example:**

#### **Recommendations for: https://thatware.co/**

1. **Top Keywords/Phrases:**
   ```
   ['seo', 'ai', 'services', 'audit implementation', 'seo services', 'advanced seo', 'social media', 'digital marketing']
   ```
   - **What it means:**
     - This means that for the homepage (https://thatware.co/), the most important words to focus on are "SEO," "AI," "services," and other terms related to your business.
     - These are the terms search engines think your page is about, and these are the terms people are most likely to use when searching for your services.
   
   - **Action Step**:
     - **Include these keywords** in the page's title, headings, and descriptions. This will improve your search ranking for these terms.

2. **Suggested Backlinks:**
   ```
   ['https://thatware.co/', 'https://thatware.co/services/', 'https://thatware.co/advanced-seo-services/']
   ```
   - **What it means:**
     - These are some important pages within your site that you should link to from your homepage. This will help search engines understand the structure of your site better.
   
   - **Action Step**:
     - **Link to these pages** on your homepage. For example, if you're talking about your services, link to the "Advanced SEO Services" page from your homepage.

---

### **What You Should Do After Getting This Output**

1. **Optimize Your Content**:
   - Based on the **Top Keywords/Phrases**, make sure that these keywords are used naturally throughout your page. The most important places to use them are:
     - **Title Tag**: This is what shows up in Google search results as the title of your page.
     - **Meta Description**: This is the short description that shows up under your title in search results.
     - **Headings (H1, H2)**: These are the big, bold headers on your page.
     - **Body Text**: The content of the page itself.

   - By doing this, you'll improve your SEO, making it more likely that search engines will show your pages to people searching for these terms.

2. **Work on Backlinks**:
   - **Reach out to the suggested backlink websites** if possible. Building relationships with these sites can lead to them linking to your content, which will improve your search engine ranking.
   - **Link to these websites** within your own content. If the suggested backlinks are relevant, add them to your articles or service pages as external resources.
   
   - **Backlinks improve trust** with search engines. The more high-quality sites that link to you, the higher you’ll rank.

3. **Avoid Overusing Repeated Keywords**:
   - If you see certain keywords repeated across many pages (for example, "SEO" or "services"), you can still use them but also try to find **more unique keywords** that describe the specific content of each page. This will help differentiate your pages in search results.

---


### **What are backlinks?**
Backlinks are **links from one website to another**. They are important for SEO (Search Engine Optimization) because search engines like Google consider backlinks as votes of confidence. If your website has backlinks from trusted and authoritative websites, it tells Google that your website is trustworthy and relevant, which can help you rank higher in search results.

### **The model and the backlink suggestions:**
The model you used analyzed the content of your pages and recommended **potential backlinks** from other websites that have relevant content. For example:

- For **https://thatware.co/branding-press-release-services/**, the model suggested backlinks from websites like **Semrush** and **IndeedSEO**.
- For **https://thatware.co/conversion-rate-optimization/**, it suggested backlinks from **Thrive Agency** and **Backlinko**.

These are **external websites** that are recognized as authorities in their fields, which means getting backlinks from them could potentially help boost your SEO. The model suggests them because these websites have content related to the keywords on your page, which can help improve the SEO relevance of your website.

### **How can these backlinks help?**
If **these websites link to your website**, it tells search engines that your content is valuable. For example, if **Backlinko** or **Thrive Agency** links to your page, it can boost your page’s ranking in search results because these websites already have high authority.

### **What to do with these backlink suggestions?**
Now, your model generated **suggestions from other websites**, but those websites do not automatically link back to your website. To benefit from these backlinks, you need to **build relationships** with the webmasters or owners of these websites and ask them to include a link to your content on their pages. Here's what you need to do:

### **How to get backlinks from these websites:**

1. **Outreach**:
   - You will need to **reach out** to the owners or webmasters of these websites (for example, the marketing team at **Thrive Agency** or **Backlinko**) and ask them to link to your content. This is typically done through **email outreach** where you explain why linking to your content would be valuable to their audience.
   
   Example:
   - You can send them an email, explaining that you found their article helpful, and suggest that they add a link to your related article to offer additional value to their readers.

2. **Create valuable content**:
   - Websites are more likely to link to you if you offer high-quality, **useful content**. Ensure that the page you want them to link to provides valuable insights, data, or tools that their audience will appreciate.
   - For instance, if you want backlinks for your **SEO services page**, make sure your page has useful guides, case studies, or tools related to SEO that will make other websites want to reference your content.

3. **Guest Posting**:
   - Another method is to offer to **write a guest post** for their blog. Many websites allow guest posts in exchange for a backlink to your website. This way, you can provide valuable content to their audience and also gain a backlink.
   
4. **Content partnerships**:
   - You can propose **partnerships** where you exchange links with websites that are not direct competitors but have overlapping audiences.

5. **Link exchanges**:
   - In some cases, websites may agree to a **link exchange**: they link to you, and in return, you link to them. However, be cautious with this approach as Google can penalize websites if they detect excessive reciprocal linking.

### **Do I need to link their website on my webpage?**
You don’t necessarily have to put links to their websites on your page unless you are participating in a **link exchange**. Generally, the focus should be on getting **backlinks from them to your page**, not the other way around. The more **relevant websites link back to your website**, the more it helps your SEO.

### **Summary:**
- **Backlinks from other websites** are an important factor in improving your website's ranking on Google.
- The model suggested **websites with relevant content** where getting a backlink would be beneficial.
- To **get these backlinks**, you will need to engage in **outreach** and build relationships with those websites.
- You can offer **valuable content, guest posts, or partnerships** to encourage these websites to link to your content.
- You don’t necessarily need to link to them from your website unless you are agreeing to a link exchange.

### **Practical Example:**
For example, for your **conversion rate optimization page**, the model suggested backlinks from websites like **Backlinko**. You could email the owner of Backlinko and say something like:

> Hi [Name],  
> I came across your fantastic guide on Conversion Rate Optimization on Backlinko. I recently wrote a detailed article on [related topic] and thought it might complement your existing content. Would you consider adding a link to my article in your guide? I think your readers would find it useful as it offers [specific benefit].  

By doing this, you are building a relationship with authoritative websites and increasing the chances of getting a valuable backlink.

---


### **What Do These Two Parts (3rd & 4th) Do?**

#### **Third Part of the Code:**
- **What it Does**:
  The third part of the code trains a **Reinforcement Learning (RL) model** using a technique called **Q-learning**. The model attempts to learn which actions (keywords or backlinks) improve website traffic the most, based on the traffic level (low, medium, or high). The actions are different keywords and backlinks.
  
  - **Q-table**: It stores the values that tell the model how good certain actions (using a keyword or backlink) are for a specific traffic level.
  - **Training Process**: Over 2000 episodes, the model explores different actions and adjusts its Q-table based on the traffic changes it sees. The final output shows the "learned" Q-table, which represents the model’s understanding of which actions are most effective at different traffic levels.

- **Output**:
  The output is a **Q-table** that shows values for different actions and traffic levels. It’s essentially the "memory" of the model, showing what it has learned from the training process.
  ```
  Training completed. Final Q-table:
  [[1676.24447627 1666.02790632 1996.34255877 1624.49663865 1673.98876038 1466.64415834 1499.84684778 1804.60832792]
   [1921.560031   1983.11185917 1975.15092713 2112.46243101 1978.33027805 2001.61831645 1984.89907545 1970.24794971]
   [2012.01055189 1991.54930691 1927.24739103 2015.6133209  2123.14941818 1894.95263976 1933.28330589 1929.54493247]]
  ```
  The Q-table helps the model determine which actions are most effective for each traffic level after training.

#### **Fourth Part of the Code:**
- **What it Does**:
  The fourth part **uses the Q-table** (trained in the 3rd part) to make real-time recommendations. It predicts the best action (whether to use a specific keyword or backlink) based on the current traffic level (e.g., low, medium, or high traffic).

- **Output**:
  For each simulated traffic level (300, 1500, 6000 visitors), it recommends a specific action (e.g., use a keyword like "seo" or "use_marketing"). The output is:
  ```
  For traffic level 300, the recommended action is: use_seo
  For traffic level 1500, the recommended action is: use_advanced
  For traffic level 6000, the recommended action is: use_marketing
  ```

  This output shows what the model recommends to do based on the traffic the website currently has.

---

### **Should These Two Parts Be Included in the Final SEO Model?**

Yes, these two parts **should** be included, but let me explain why and how they fit into the overall **Reinforcement Learning for SEO Model**:

1. **Third Part (Training the Q-Table)**:
   - **Why It’s Important**: This part is where the **actual learning** takes place. Without this step, the model wouldn’t know which actions are effective for different traffic levels. It’s like teaching the model how to think and make decisions based on the feedback (rewards) it receives.
   - **What Insight It Provides**: This part helps the model **learn from experience**. It simulates taking actions (like using keywords or backlinks) and adjusting its strategy based on how the traffic changes. The final Q-table is the result of this learning process, showing which actions the model believes are the most effective.
   - **In Simple Terms**: This is the part where the model "learns" the relationship between actions (keywords/backlinks) and website traffic. You need this part because without it, the model won’t know what actions work best for improving traffic.

2. **Fourth Part (Using the Q-Table to Recommend Actions)**:
   - **Why It’s Important**: This part is where the model **uses what it has learned** to make **real-time recommendations**. After the Q-table is trained, the model needs to apply its knowledge to suggest actions for current traffic levels.
   - **What Insight It Provides**: This part gives **practical recommendations** based on the current traffic the website is experiencing. For example, if your traffic is low (300 visitors), the model might recommend focusing on SEO keywords. If the traffic is higher (6000 visitors), it may recommend marketing strategies.
   - **In Simple Terms**: This is the part where the model **tells you what to do** based on your current traffic. Without it, you wouldn’t get specific actions to take based on the traffic levels you have.

---

### **Why Does the Output Seem Simple?**

I understand why you might feel the output seems too simple (like "use_seo" for low traffic and "use_marketing" for high traffic). This simplicity comes from how the model has been trained to prioritize certain actions based on traffic levels. The logic behind it is:

- **Low Traffic (300 visitors)**: The model suggests using **basic SEO techniques** (like the keyword "SEO") to improve visibility.
- **Medium Traffic (1500 visitors)**: It suggests **more advanced SEO techniques** to refine and increase traffic quality.
- **High Traffic (6000 visitors)**: It shifts the focus to **marketing strategies** to further boost and engage visitors.

### **How Does This Help a Website Owner?**

As a website owner, these recommendations can guide you in **adapting your strategy based on your traffic levels**. Here’s how you can explain it to a client:

- **If traffic is low (around 300 visitors)**, the focus should be on **basic SEO** to improve visibility on search engines. The model suggests using keywords like "SEO" to optimize content for search engines.
  
- **If traffic is medium (around 1500 visitors)**, you should start using **advanced SEO techniques**. This might involve optimizing the technical structure of the website, improving user experience, or targeting more specific, long-tail keywords.
  
- **If traffic is high (around 6000 visitors)**, the model recommends shifting the focus to **marketing strategies**. This could include content marketing, paid advertising, or social media engagement to attract even more visitors and retain existing ones.

---

### **Steps You Can Take After Getting This Output**

1. **Understand Traffic Levels**:
   - First, determine the current traffic on your website. Are you in the low, medium, or high traffic category?
   
2. **Apply Recommendations Based on Traffic**:
   - **Low Traffic**: Focus on SEO by using basic keywords like "SEO" to optimize your content for search engines.
   - **Medium Traffic**: Implement advanced SEO techniques to improve traffic quality.
   - **High Traffic**: Focus on broader marketing strategies to engage a larger audience.

3. **Monitor Results**: After applying the recommended actions, monitor how your traffic changes and continue adjusting your strategy based on what’s working.

---

### **Summary:**

- **Third Part**: This part trains the model, teaching it which actions (keywords/backlinks) improve traffic.
- **Fourth Part**: This part applies the model’s knowledge to give real-time recommendations based on current traffic.
  
Both parts are crucial for the **Reinforcement Learning for SEO Model** because the third part teaches the model, and the fourth part applies what’s learned to recommend real-world actions. You should definitely include them in the final code because they provide the foundation for how the model makes its decisions.
