### 1.SEO: Discover the Technical Challenges to get the Number of Indexed Pages on Google and Sitemap.

### The aim:

This Python script performs a search to check the number of indexed pages on Google for multiple sites using Selenium, bs4 and Python. And compare that with the real indexed pages from Sitemap-parser.

This is not exact at all. But it is good enough to have an idea of the competitor’s index sizes when building an SEO strategy.

You can use python for SEO by leveraging APIs, automating the boring tasks and by implementing machine learning algorithms.

Why making this script then? Because a lot of times you just want to have an idea of how committed a site is in a market. Around how many pages does the business deal with 200 or 200M.

https://www.jcchouinard.com/get-number-of-indexed-pages-on-multiple-sites-with-python/

### import libraries

In [1]:
import pandas as pd
import requests

import csv
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

from bs4 import BeautifulSoup 

# Selenium

In [2]:
 
urls = [
    'weclouddata.com',
    'brainstation.io',
    'lighthouselabs.ca',
    'junocollege.com',
    'metroc.ca'
    ]
 
indexes = {}
xpath = '//*[@id="result-stats"]' 

def get_index(url,xpath,headless=True):
    '''
    Run Selenium.
    Get number of indexed pages.
    url: full url that you want to extract
    headless: define if your want to see the browser opening or not.
    '''
    print(f'Opening {url}')
    options = Options()
    options.headless = headless
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    index = driver.find_element(By.XPATH, xpath).text
    index = index.split('About ')[1].split(' results')[0]
    print(f'Index: {index}')
    driver.quit()
    return index
 
for url in urls:
    search_url = f'https://www.google.com/search?q=site%3A{url}&oq=site%3A{url}&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8'
    index = get_index(search_url,xpath,headless=True)
    indexes[url] = index 
    time.sleep(1)
 
df_Sel = pd.DataFrame.from_dict(indexes, orient='index', columns=['indexed_pages'])
#df_Sel.to_csv('indexed_pages_selenium.csv')

Opening https://www.google.com/search?q=site%3Aweclouddata.com&oq=site%3Aweclouddata.com&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8


  options.headless = headless


Index: 471
Opening https://www.google.com/search?q=site%3Abrainstation.io&oq=site%3Abrainstation.io&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8
Index: 13,700
Opening https://www.google.com/search?q=site%3Alighthouselabs.ca&oq=site%3Alighthouselabs.ca&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8
Index: 2,420
Opening https://www.google.com/search?q=site%3Ajunocollege.com&oq=site%3Ajunocollege.com&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8
Index: 502
Opening https://www.google.com/search?q=site%3Ametroc.ca&oq=site%3Ametroc.ca&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8
Index: 290


In [3]:
df_Sel

Unnamed: 0,indexed_pages
weclouddata.com,471
brainstation.io,13700
lighthouselabs.ca,2420
junocollege.com,502
metroc.ca,290


# BeautifulSoup bs4

In [4]:

    
urls = [
    'weclouddata.com',
    'brainstation.io',
    'lighthouselabs.ca',
    'junocollege.com',
    'metroc.ca'
    ]
 
indexes = {}
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
 
def make_request(url,headers):
    try:
        r = requests.get(url, headers=headers)
    except requests.exceptions.RequestException as e: 
        raise SystemExit(e)
    return r
 
for url in urls:
    search_url = f'https://www.google.com/search?q=site%3A{url}&oq=site%3A{url}&aqs=chrome..69i57j69i58.6029j0j1&sourceid=chrome&ie=UTF-8'
    r = make_request(search_url,headers)
    soup = BeautifulSoup(r.text, "html.parser")
    index = soup.find('div',{'id':'result-stats'}).text
    index = index.split('About ')[1].split(' results')[0]
    indexes[url] = index 
    time.sleep(1)
 

df_bs4 = pd.DataFrame.from_dict(indexes, orient='index', columns=['indexed_pages'])
#df_bs4.to_csv('indexed_pages_bs4.csv')

In [5]:
df_bs4

Unnamed: 0,indexed_pages
weclouddata.com,471
brainstation.io,13700
lighthouselabs.ca,2420
junocollege.com,502
metroc.ca,290


#  Sitemap-parser

https://github.com/ceaksan/PageContentAnalysis/tree/main/Python

In [66]:
#!pip install ultimate-sitemap-parser
#!pip insall sitemap_tree_for_homepage
#!pip insall pandas

In [6]:
import csv, requests
from usp.tree import sitemap_tree_for_homepage
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

# weclouddata.com

In [17]:
tree = sitemap_tree_for_homepage('https://weclouddata.com')

pageDetails = [[
    page.url,
    page.last_modified.isoformat('#','hours').split('#')[0] if page.last_modified else None,
    float(page.priority) if page.priority else None] for page in tree.all_pages()]

with open('weclouddata_pages.csv', 'w+', newline='') as fl:
    write = csv.writer(fl)
    write.writerow(['URL', 'LastModified', 'Priority'])
    write.writerows(pageDetails)

print(f'{len(pageDetails)} rows founded!')

2023-10-21 09:33:13,397 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://weclouddata.com/robots.txt...
2023-10-21 09:33:13,398 INFO usp.helpers [23980/MainThread]: Fetching URL https://weclouddata.com/robots.txt...
2023-10-21 09:33:13,945 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://weclouddata.com/robots.txt...
2023-10-21 09:33:13,946 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://weclouddata.com/sitemap_index.xml...
2023-10-21 09:33:13,946 INFO usp.helpers [23980/MainThread]: Fetching URL https://weclouddata.com/sitemap_index.xml...
2023-10-21 09:33:33,216 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://weclouddata.com/sitemap_index.xml...
2023-10-21 09:33:33,217 INFO usp.fetch_parse [23980/MainThread]: Fetching level 1 sitemap from https://weclouddata.com/post-sitemap.xml...
2023-10-21 09:33:33,218 INFO usp.helpers [23980/MainThread]: Fetching URL https://weclouddata.

1008 rows founded!


In [18]:
df_wcd = pd.read_csv('weclouddata_pages.csv')
df_wcd

Unnamed: 0,URL,LastModified,Priority
0,https://weclouddata.com/blog/,2023-09-14,0.5
1,https://weclouddata.com/uncategorized/big-data...,2019-11-09,0.5
2,https://weclouddata.com/blog/consulting/consul...,2021-10-20,0.5
3,https://weclouddata.com/blog/consulting/consul...,2021-10-20,0.5
4,https://weclouddata.com/blog/consulting/consul...,2021-10-20,0.5
...,...,...,...
1003,https://weclouddata.com/location/vancouver/,2023-09-13,0.5
1004,https://weclouddata.com/pace/full-time/,2023-09-13,0.5
1005,https://weclouddata.com/pace/part-time/,2023-10-11,0.5
1006,https://weclouddata.com/pricing_currency/cad/,2023-08-09,0.5


In [11]:
# https://pythontic.com/pandas/dataframe-attributes/introduction
print(df_wcd.shape)
print(df_wcd.columns)
print(df_wcd.describe)
print(df_wcd.info)

(1008, 3)
Index(['URL', 'LastModified', 'Priority'], dtype='object')
<bound method NDFrame.describe of                                                     URL LastModified  Priority
0                         https://weclouddata.com/blog/   2023-09-14       0.5
1     https://weclouddata.com/uncategorized/big-data...   2019-11-09       0.5
2     https://weclouddata.com/blog/consulting/consul...   2021-10-20       0.5
3     https://weclouddata.com/blog/consulting/consul...   2021-10-20       0.5
4     https://weclouddata.com/blog/consulting/consul...   2021-10-20       0.5
...                                                 ...          ...       ...
1003        https://weclouddata.com/location/vancouver/   2023-09-13       0.5
1004            https://weclouddata.com/pace/full-time/   2023-09-13       0.5
1005            https://weclouddata.com/pace/part-time/   2023-10-11       0.5
1006      https://weclouddata.com/pricing_currency/cad/   2023-08-09       0.5
1007      https://weclouddat

# brainstation.io

In [19]:
tree = sitemap_tree_for_homepage('https://brainstation.io')

pageDetails = [[
    page.url,
    page.last_modified.isoformat('#','hours').split('#')[0] if page.last_modified else None,
    float(page.priority) if page.priority else None] for page in tree.all_pages()]

with open('brainstation_pages.csv', 'w+', newline='') as fl:
    write = csv.writer(fl)
    write.writerow(['URL', 'LastModified', 'Priority'])
    write.writerows(pageDetails)

print(f'{len(pageDetails)} rows founded!')

2023-10-21 09:34:50,480 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://brainstation.io/robots.txt...
2023-10-21 09:34:50,480 INFO usp.helpers [23980/MainThread]: Fetching URL https://brainstation.io/robots.txt...
2023-10-21 09:34:51,037 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://brainstation.io/robots.txt...
2023-10-21 09:34:51,039 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://brainstation.io/sitemap.xml...
2023-10-21 09:34:51,040 INFO usp.helpers [23980/MainThread]: Fetching URL https://brainstation.io/sitemap.xml...
2023-10-21 09:34:51,956 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://brainstation.io/sitemap.xml...
2023-10-21 09:34:51,961 INFO usp.fetch_parse [23980/MainThread]: Fetching level 1 sitemap from https://brainstation.io/sitemap-main.xml...
2023-10-21 09:34:51,961 INFO usp.helpers [23980/MainThread]: Fetching URL https://brainstation.io/sitemap-main.x

11298 rows founded!


In [20]:
df_brs = pd.read_csv('brainstation_pages.csv')
df_brs

Unnamed: 0,URL,LastModified,Priority
0,https://brainstation.io/,,1.0
1,https://brainstation.io/about,,0.8
2,https://brainstation.io/business/online,,0.8
3,https://brainstation.io/careers,,1.0
4,https://brainstation.io/careers/teach,,1.0
...,...,...,...
11293,https://brainstation.io/magazine/author/rebecc...,2023-02-27,0.3
11294,https://brainstation.io/magazine/author/justin...,2021-07-27,0.3
11295,https://brainstation.io/magazine/author/josh-c...,2022-04-22,0.3
11296,https://brainstation.io/magazine/author/nick-p...,2022-09-28,0.3


# lighthouselabs.ca

In [21]:
tree = sitemap_tree_for_homepage('https://lighthouselabs.ca')

pageDetails = [[
    page.url,
    page.last_modified.isoformat('#','hours').split('#')[0] if page.last_modified else None,
    float(page.priority) if page.priority else None] for page in tree.all_pages()]

with open('lighthouselabs_pages.csv', 'w+', newline='') as fl:
    write = csv.writer(fl)
    write.writerow(['URL', 'LastModified', 'Priority'])
    write.writerows(pageDetails)

print(f'{len(pageDetails)} rows founded!')

2023-10-21 09:40:51,331 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://lighthouselabs.ca/robots.txt...
2023-10-21 09:40:51,332 INFO usp.helpers [23980/MainThread]: Fetching URL https://lighthouselabs.ca/robots.txt...
2023-10-21 09:40:52,380 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://lighthouselabs.ca/robots.txt...
2023-10-21 09:40:52,381 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://www.lighthouselabs.ca/sitemap.xml...
2023-10-21 09:40:52,382 INFO usp.helpers [23980/MainThread]: Fetching URL https://www.lighthouselabs.ca/sitemap.xml...
2023-10-21 09:40:52,908 INFO usp.fetch_parse [23980/MainThread]: Parsing sitemap from URL https://www.lighthouselabs.ca/sitemap.xml...
2023-10-21 09:40:52,950 INFO usp.fetch_parse [23980/MainThread]: Fetching level 0 sitemap from https://lighthouselabs.ca/sitemap...
2023-10-21 09:40:52,951 INFO usp.helpers [23980/MainThread]: Fetching URL https://lighthouselab

1118 rows founded!


In [22]:
df_lh = pd.read_csv('lighthouselabs_pages.csv')
df_lh

Unnamed: 0,URL,LastModified,Priority
0,https://www.lighthouselabs.ca,2023-10-18,1.0
1,https://www.lighthouselabs.ca/en/web-developme...,2023-10-18,0.9
2,https://www.lighthouselabs.ca/en/data-science-...,2023-10-18,0.9
3,https://www.lighthouselabs.ca/en/intro-web-dev...,2023-10-18,0.9
4,https://www.lighthouselabs.ca/en/intro-front-e...,2023-10-18,0.9
...,...,...,...
1113,https://www.lighthouselabs.ca/fr/partenariats-...,2023-10-18,0.5
1114,https://www.lighthouselabs.ca/fr/faq,2023-10-18,0.7
1115,https://www.lighthouselabs.ca/fr/demande,2023-10-18,0.9
1116,https://www.lighthouselabs.ca/fr/communaute,2023-10-18,0.6


# Notes:
- the Number of Indexed Pages that scraped from weclouddata sitemap is 1008 whereas from google is about max 471.
- the Number of Indexed Pages that scraped from brainstation.io sitemap is 11298 whereas from google is about max 13,700.
- the Number of Indexed Pages that scraped from lighthouselabs.ca sitemap is 1118 whereas from google is about max 2,420.

When run the codes many times that will give Number of Indexed Pages are different every times.
- for brainstation.io and lighthouselabs.ca are alomst hit the highest numbers from the first time.
- for weclouddata.com and others are unstable. weclouddata is started so low around 471 and with run many times will increase to 859