# Schema analysis in Bulk


The primary goal is to write great content annotated with schema markup

- Remember your customers and quality content come first

- Use schematic markup to convey information about the pages


### About Alton

Follow me for more data and tutorials

- twitter: https://twitter.com/alton_lex @alton_lex

- linkedin: https://www.linkedin.com/in/altonalexander/


### About Data Winners

Join the conversation:

- private Discord community

- Video tutorials

- Feedback and support on this and other scripts

Join now: https://datawinners.gumroad.com/l/data-analytics-for-seo


# Setup Environment

In [3]:
%pip install extruct

Defaulting to user installation because normal site-packages is not writeable
Collecting extruct
  Downloading extruct-0.14.0-py2.py3-none-any.whl (25 kB)
Collecting pyrdfa3
  Using cached pyRdfa3-3.5.3-py3-none-any.whl (121 kB)
Collecting jstyleson
  Using cached jstyleson-0.0.2-py3-none-any.whl
Collecting html-text>=0.5.1
  Using cached html_text-0.5.2-py2.py3-none-any.whl (7.5 kB)
Collecting mf2py
  Using cached mf2py-1.1.2-py3-none-any.whl
Collecting rdflib>=6.0.0
  Using cached rdflib-6.2.0-py3-none-any.whl (500 kB)
Collecting isodate
  Using cached isodate-0.6.1-py2.py3-none-any.whl (41 kB)
Installing collected packages: jstyleson, isodate, html-text, rdflib, mf2py, pyrdfa3, extruct
Successfully installed extruct-0.14.0 html-text-0.5.2 isodate-0.6.1 jstyleson-0.0.2 mf2py-1.1.2 pyrdfa3-3.5.3 rdflib-6.2.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m]

In [4]:
# setup libraries

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse
from pprint import pprint

# Extract schema from one page

In [138]:
def extract_schema(url):
    """Extract all metadata present in the page and return a dictionary of metadata lists. 
    
    Args:
        url (string): URL of page from which to extract metadata. 
    
    Returns: 
        metadata (dict): Dictionary of json-ld, microdata, and opengraph lists. 
        Each of the lists present within the dictionary contains multiple dictionaries.
    """
    
    try:
        r = requests.get(url)
        base_url = get_base_url(r.text, r.url)
        metadata = extruct.extract(r.text, 
                                   base_url=base_url,
                                   uniform=True,
                                   syntaxes=['json-ld',
                                             'microdata',
                                             'opengraph'])
        return metadata
    except:
        return {}

In [139]:
# test on one domain

url = "https://bikelanes.com/seattle"

schema_json = extract_schema( url )

schema_json

{'microdata': [{'value': '',
   '@type': 'WebPageElement',
   '@context': 'https://schema.org'}],
 'json-ld': [{'@context': 'https://schema.org',
   '@graph': [{'@type': 'WebSite',
     '@id': 'https://bikelanes.com#website',
     'url': 'https://bikelanes.com',
     'name': 'Bike Lanes',
     'inLanguage': 'en-US'},
    {'@type': 'WebPage',
     '@id': 'https://bikelanes.com/seattle/#webpage',
     'url': 'https://bikelanes.com/seattle/',
     'name': 'Bike lanes in Seattle',
     'mainContentOfPage': {'@type': 'WebPageElement',
      'xpath': "//*[@id='mainContentOfPage']"},
     'datePublished': '2021-03-27',
     'dateModified': '2022-01-08',
     'keywords': '{location} bike lanes,bike routes in {location},{location} bike routes,bike routes near me,bike lanes near me,bike lanes in {location}',
     'isPartOf': {'@id': 'https://bikelanes.com#website'},
     'breadcrumb': {'@id': 'https://bikelanes.com/seattle/#breadcrumb'},
     'inLanguage': 'en-US',
     'potentialAction': [{'@ty

# Get just one schema item

In [89]:
def get_dictionary_by_key_value(dictionary, target_key, target_value):
    """Return a dictionary that contains a target key value pair. 
    
    Args:
        dictionary: Metadata dictionary containing lists of other dictionaries.
        target_key: Target key to search for within a dictionary inside a list. 
        target_value: Target value to search for within a dictionary inside a list. 
    
    Returns:
        target_dictionary: Target dictionary that contains target key value pair. 
    """
    
    if isinstance(dictionary, list):
        for eachitem in dictionary:
            found = get_dictionary_by_key_value(eachitem, target_key, target_value)
            if found:
                return found
    
    if not isinstance( dictionary, dict):
        return None
    
    # check if it is at first level?
    if dictionary.get(target_key) == target_value:
        return dictionary
        
    for key in dictionary:
        if len(dictionary[key]) > 0:
            
            if isinstance(dictionary[key], list):
                for item in dictionary[key]:
                    
                    if isinstance(item, list):
                        for eachitem in item:
                            get_dictionary_by_key_value(eachitem, target_key, target_value)

    
                    if isinstance( item, dict):
                        if item.get(target_key) == target_value:
                            return item


                        # crawl deeper
                        for item2 in item:

                            nested = item[item2]

                            found = get_dictionary_by_key_value(nested, target_key, target_value)
                            if found:
                                return found
                            
                            if isinstance( item2, list):
                                print("#### list", item)
                    
    return None

In [91]:
# test

BreadcrumbList = get_dictionary_by_key_value(schema_json, "@type", "BreadcrumbList")
BreadcrumbList

{'@type': 'BreadcrumbList',
 '@id': 'https://bikelanes.com/seattle/#breadcrumb',
 'itemListElement': [{'@type': 'ListItem',
   'position': 1,
   'item': {'@type': 'WebSite',
    '@id': 'https://bikelanes.com#website',
    'url': 'https://bikelanes.com',
    'name': 'Bike Lanes'}},
  {'@type': 'ListItem',
   'position': 2,
   'item': {'@type': 'WebPage',
    '@id': 'https://bikelanes.com/seattle/#webpage',
    'url': 'https://bikelanes.com/seattle/',
    'name': 'Bike lanes in Seattle',
    'hasPart': {'@id': 'https://bikelanes.com/seattle/#article',
     '@type': 'Article'}}}]}

# Find all types

In [120]:
# find all types

def find_all(dictionary, target_key, targets_found=[]):
    """
    Return a list of values for the target key 
    """
    
    if isinstance(dictionary, list):
        for eachitem in dictionary:
            found = find_all(eachitem, target_key)
            if found:
                targets_found.extend(found)
    
    if not isinstance( dictionary, dict):
        return None
    
    # check if it is at first level?
    if dictionary.get(target_key):
        return [dictionary.get(target_key)]
        
    for key in dictionary:
        if len(dictionary[key]) > 0:
            
            if isinstance(dictionary[key], list):
                for item in dictionary[key]:
                    
                    if isinstance(item, list):
                        for eachitem in item:
                            find_all(eachitem, target_key)

    
                    if isinstance( item, dict):
                        if item.get(target_key):
                            targets_found.extend([item.get(target_key)])


                        # crawl deeper
                        for item2 in item:

                            nested = item[item2]

                            found = find_all(nested, target_key)
                            if found:
                                targets_found.extend(found)
                            
                            if isinstance( item2, list):
                                print("#### list", item)
                    
    return list(set(targets_found))

found = find_all(schema_json, "@type")
found

['BreadcrumbList',
 'WebSite',
 'WebPageElement',
 'WebPage',
 'Organization',
 'Article',
 'article']

# Bulk extract schema from serp

In [121]:
# get the libraries
!pip install googlesearch-python

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [122]:
import requests
from googlesearch import search
from bs4 import BeautifulSoup

In [127]:
# run the query

query = "bike lanes"
results_generator = search(query, num_results=10, lang="en")


In [128]:

# loop over the generator and save to a dictionary
results = {}

i = 0
for link in list(results_generator):
    print(link)
    if link not in results:
        # add link the first time
        results[ link ] = {
            "url": link,
            "rank": i,
            "q": query
        }
    i = i+1

results

https://nacto.org/publication/urban-bikeway-design-guide/bike-lanes/
https://www.bikeleague.org/content/bike-lanes
https://orem.org/wp-content/uploads/2020/05/CityTrailsAndBikeLanes.pdf
https://maps.provo.org/downloads/Trails_Map.pdf
https://www.peopleforbikes.org/reports/protected-bikes-lanes-101
https://www.stronggo.com/blog/6-benefits-bike-lanes
https://www.shmoop.com/drivers-ed/utah/bicycle-lanes.html
https://portal.ct.gov/-/media/DOT/PLNG_PLANS/BikePedPlan/driver_pamphlet_pi.pdf
https://www.peopleforbikes.org/reports/protected-bikes-lanes-101
https://maps.provo.org/downloads/Trails_Map.pdf
https://portal.ct.gov/-/media/DOT/PLNG_PLANS/BikePedPlan/driver_pamphlet_pi.pdf


{'https://nacto.org/publication/urban-bikeway-design-guide/bike-lanes/': {'url': 'https://nacto.org/publication/urban-bikeway-design-guide/bike-lanes/',
  'rank': 0,
  'q': 'bike lanes'},
 'https://www.bikeleague.org/content/bike-lanes': {'url': 'https://www.bikeleague.org/content/bike-lanes',
  'rank': 1,
  'q': 'bike lanes'},
 'https://orem.org/wp-content/uploads/2020/05/CityTrailsAndBikeLanes.pdf': {'url': 'https://orem.org/wp-content/uploads/2020/05/CityTrailsAndBikeLanes.pdf',
  'rank': 2,
  'q': 'bike lanes'},
 'https://maps.provo.org/downloads/Trails_Map.pdf': {'url': 'https://maps.provo.org/downloads/Trails_Map.pdf',
  'rank': 3,
  'q': 'bike lanes'},
 'https://www.peopleforbikes.org/reports/protected-bikes-lanes-101': {'url': 'https://www.peopleforbikes.org/reports/protected-bikes-lanes-101',
  'rank': 4,
  'q': 'bike lanes'},
 'https://www.stronggo.com/blog/6-benefits-bike-lanes': {'url': 'https://www.stronggo.com/blog/6-benefits-bike-lanes',
  'rank': 5,
  'q': 'bike lanes'}

In [147]:
# get the schema type for each one

for url in results:
    schema_json = extract_schema( url )
    results[url]['schema_types'] = find_all(schema_json, "@type", [])
    pprint(results[url])
    print()


{'q': 'bike lanes',
 'rank': 0,
 'schema_types': ['WebSite',
                  'BreadcrumbList',
                  'WebPageElement',
                  'ListItem',
                  'WebPage',
                  'Organization',
                  'Article',
                  'CreativeWork',
                  'article'],
 'url': 'https://nacto.org/publication/urban-bikeway-design-guide/bike-lanes/'}

{'q': 'bike lanes',
 'rank': 1,
 'schema_types': ['WebSite',
                  'BreadcrumbList',
                  'ListItem',
                  'WebPageElement',
                  'WebPage',
                  'Organization',
                  'Article',
                  'CreativeWork',
                  'article'],
 'url': 'https://www.bikeleague.org/content/bike-lanes'}

{'q': 'bike lanes',
 'rank': 2,
 'schema_types': [],
 'url': 'https://orem.org/wp-content/uploads/2020/05/CityTrailsAndBikeLanes.pdf'}

{'q': 'bike lanes',
 'rank': 3,
 'schema_types': [],
 'url': 'https://maps.provo.org/do