## PageRank and CheiRank Calculated from ScreamingFrog Crawl Data and Moz API Data

The following workflow requires two files.

* **internal_html.csv**: Is an export of Internal HTML from Screaming Frog. The Moz API must be enabled (requires API key). Select `URL` > `MozRank External Equity` in `API Access` > `Moz` > `Metrics` of Screaming Frog.

* **all_inlinks.csv**: Is an bulk export of `Bulk Export` >  `All Inlinks` from Screaming Frog.

Both files are raw exports so Column names are the defaults and the read_csv function expects to skip the first row for Screaming Frog 12.4 and earlier (eg. skiprows=1) BUT does not expect to skip the first row for ScreamingFrog 12.5 or later (eg. skiprows=0).

A follow up to this [tweet](https://twitter.com/willem_nout/status/1101417508685467648).

Follow me, [JR Oakes](https://twitter.com/jroakes), on Twitter for more Technical SEO goodness.

### Install These Libraries

If you don't have them.  Otherwise, skip.

In [None]:
!pip install networkx
!pip insall pandas
!pip insall tqdm

### Import Libraries

In [26]:
import networkx as nx
import pandas as pd
import re
from tqdm import tqdm

### Specify Some Variables

We need to specify some variables we will use later.

In [45]:
# Base domain for your property
domain = "domain.com"

# Specify the output filename base.  Auto-generated from domain if `None`.
filename = None

### Functions Which Consolidate URLs
1. Consolidates URLs to canonical versions based on canonical link element.
1. Provides mapping dictionary for 30X URls from intial 301 to final 200 (canonical) URL.

In [3]:

def apply_mapping(row, mapping):
    if row['Destination'] in mapping:
        row['Destination'] = mapping[row['Destination']]
    if row['Source'] in mapping:
        row['Source'] = mapping[row['Source']]
    return row


def consolidate_urls(df_html):
    
    mappings = {}
    
    # Consolidate canonicals
    good_statuses = [200]
    df_html_200 = pd.DataFrame()
    
    df_html_good = df_html[df_html['Status Code'].isin(good_statuses)]
    
    for i, row in tqdm(df_html_good.iterrows(), total=df_html_good.shape[0]):
        
        canonical = str(row['Canonical Link Element 1'])
        
        if "/" in canonical and canonical != row['Address']:
            mappings[row['Address']] = canonical
            row['Address'] = canonical
        else:
            mappings[row['Address']] = row['Address']
            
        df_html_200 = df_html_200.append(row, ignore_index=True)
    
    df_html_200 = df_html_200.groupby(['Address'], as_index=False).agg({'Moz External Equity Links - Exact': 'sum', 'Outlinks':'max'})
    
    # Create mapping for redirects
    redirect_statuses = [301,302]
    df_html_redir = df_html[df_html['Status Code'].isin(redirect_statuses)]
    
    addresslist = df_html_redir['Address'].tolist()
    redirlist =  df_html_redir['Redirect URL'].tolist()
    
    for i, address in tqdm(enumerate(addresslist)):
 
        redir = redirlist[i]
        
        if redir in mappings:
            mappings[address] = mappings[redir]
        else:
            for _ in range(5):
                if redir in addresslist:
                    redir = redirlist[addresslist.index(redir)]
                    if redir in mappings:
                        mappings[address] = mappings[redir]
                        break
                        
                        
    return df_html_200, mappings
        

### Read Crawl HTML Data

Crawl with Screaming Frog and export HTML.  Ensure that you respect robots.txt, noindex, canonical, etc to try to get as close a representation to what Google gets as possible.

**Warning**: Make sure that `URL` > `MozRank External Equity` is selected in `API Access` > `Moz` > `Metrics` of Screaming Frog (requires API key) 

**Expects**: `internal_html.csv` file from Screaming Frog

In [None]:
df_html = pd.read_csv('internal_html.csv', skiprows=0)

# Grab 200 urls and canonicalize
df_html, mappings = consolidate_urls(df_html)

df_html = df_html[['Address','Moz External Equity Links - Exact', 'Outlinks']]
df_html.columns = ['Address', 'Equity', 'Outlinks']
df_html.head()

### Read Internal Link Data

After the prior crawl, use the Bulk Export tool to export All Inlinks. We then clean the data a bit to ensure we have only the links that we want.

**Expects**: `all_inlinks.csv` file from Screaming Frog

In [None]:
df_links = pd.read_csv('all_inlinks.csv', skiprows=0, low_memory=False)

# keep only Ahref and Follow
df_links = df_links[(df_links['Type'] == "AHREF") & (df_links['Follow'] == True)]

# keep only internal links
df_links = df_links[(df_links['Destination'].str.match(r'^(http:|https:)//(www.)?{}.*$'.format(domain), case=False)) & (df_links['Source'].str.match(r'^(http:|https:)//(www.)?{}.*$'.format(domain), case=False))]

# Map links to their final destination
df_links = df_links.apply(apply_mapping, axis=1, args=(mappings,))

# Keep only the columns we need
df_links = df_links[['Source','Destination']]

df_links.head()

### Clean the Links in Both Datasets
Converts the urls to paths and removes trailing slashes.  

**Warning**: This is not really needed as the consolidation done earlier is more reflective of how Google handles URLs. 

In [None]:
from urllib.parse import urlparse

def remove_trail_slash(s):
    if s.endswith('/'):
        s = s[:-1]
    return s

# This may or may not be what you want to do depending on the site and can, for sure be extended to keep important querystings or consolidate canoncicals.
def apply_clean_links(row):
    
    cols=['Address', 'Source', 'Destination']
    
    for c in cols:
        if c in row:
            row[c] = remove_trail_slash(urlparse(row[c]).path)
        
    return row


df_links = df_links.apply(apply_clean_links, axis=1)
df_html = df_html.apply(apply_clean_links, axis=1)

#Consolidate External Equity
df_html = df_html.groupby([ 'Address'], as_index=False).agg({"Equity": "max","Outlinks":"max"})


df_html.head()

### Set Up the Graphs
This sets up the directed graphs used in the PR and CR algorithms

In [6]:
def traverse_dataframe(df, addresses, graph, gtype= "PR"):
    
    for i, row in df.iterrows():
        
        # Add nodes
        if 'Address' in row:
            if graph.has_node(row['Address']) == False:
                graph.add_node(row['Address'])
        # Add edges
        elif 'Destination' in row and 'Source' in row:
            
            #Skip adding edge if source or destination is not in set of pages.
            if row['Destination'] in addresses and row['Source'] in addresses:
                if gtype == 'PR':
                    graph.add_edge(row['Source'], row['Destination'])
                else:
                    graph.add_edge(row['Destination'], row['Source'])
            
        else:
            raise Exception('The correct dataframes were not supplied.  Expecting either `Address` or `Destination` and `Source` columns.')
            
            

def run_graphs(df_links, df_html):

    pr_graph = nx.DiGraph()
    cr_graph = nx.DiGraph()
    
    addresses = df_html['Address'].tolist()
    
    # Pagerank Graph
    traverse_dataframe(df_html, addresses, pr_graph, gtype= "PR")
    traverse_dataframe(df_links, addresses, pr_graph, gtype= "PR")
                  
    # CheiRank Graph
    traverse_dataframe(df_html, addresses, cr_graph, gtype= "CR")
    traverse_dataframe(df_links, addresses, cr_graph, gtype= "CR")
    
    
    return pr_graph, cr_graph
    
    

pr_graph, cr_graph = run_graphs(df_links, df_html)

### Get initial weights from Moz External Equity and run PageRank and CheiRank
This does all the work

In [7]:
adr= df_html['Address'].tolist()
eqt= df_html['Equity'].tolist()

init_nstart = {v:eqt[i] for i,v in enumerate(adr)}

scores_pr = nx.pagerank(pr_graph, nstart=init_nstart, max_iter=1000)
scores_cr = nx.pagerank(cr_graph, nstart=init_nstart, max_iter=1000)

### Plot PageRank and CheiRank Graph
**Warning**: This will more than likely run out of memory or be hard to read for large sites.

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

topn = 10

if scores_pr:
    # Sort nodes by best
    ranked_nodes_pr = sorted(((scores_pr[s],s) for i,s in enumerate(list(scores_pr.keys()))), reverse=True)
    # Get the topn nodes
    nodelist = [n[1] for n in ranked_nodes_pr][:topn]
    edgelist = [(a[0],a[1]) for a in pr_graph.edges() if a[0] in nodelist and a[1] in nodelist]
    labels = {n:n for n in nodelist}
    sizes_pr = [(scores_pr[x])*10000 for x in list(scores_pr) if x in nodelist]   
    sm = nx.draw(pr_graph, with_labels = True, node_size=sizes_pr, nodelist=nodelist, edgelist=edgelist, labels=labels)
    plt.show()
    
if scores_cr:
    # Sort nodes by best
    ranked_nodes_cr = sorted(((scores_cr[s],s) for i,s in enumerate(list(scores_cr.keys()))), reverse=True)
    # Get the topn nodes
    nodelist = [n[1] for n in ranked_nodes_cr][:topn]
    edgelist = [(a[0],a[1]) for a in cr_graph.edges() if a[0] in nodelist and a[1] in nodelist]
    labels = {n:n for n in nodelist}
    sizes_cr = [(scores_cr[x])*10000 for x in list(scores_cr) if x in nodelist]   
    sm = nx.draw(cr_graph, with_labels = True, node_size=sizes_cr, nodelist=nodelist, edgelist=edgelist, labels=labels)
    plt.show()

### Save to a CSV
Saves the initial normalized data to csv

In [None]:
def apply_scores(row,scores_pr,scores_cr):
    adr = row['Address']
    otl = int(row['Outlinks'] or 1)
    row['PageRank'] = scores_pr.get(adr,0)
    row['CheiRank'] = scores_cr.get(adr,0)
    row['Link Equity'] = float(scores_cr.get(adr,0)/otl)
   
    return row

def normalize_colums(df):
    cols = ['PageRank','Equity','CheiRank','Link Equity']
    
    for c in cols:
        df[c] = (df[c]-df[c].min())/(df[c].max()-df[c].min())
        
    return df

df_html = df_html.apply(apply_scores, args=(scores_pr,scores_cr), axis=1)
df_html_norm = normalize_colums(df_html)

fname = 'norm_' + (filename or domain.replace('.','_') + ".csv")

df_html_norm.to_csv(fname)
df_html_norm.head()

### Apply Categories
This apply function can be adjusted to apply category groupings to your data however you like.

In [35]:
def apply_cat(row):
    address = row["Address"]
    row['Category'] = "None"
    
    address_parts = re.sub(r'^https?:\/\/(www\.)?{}/'.format(domain), "", address).split('/')
    
    # Adjust below to categorize how you like. 
    
    if len(address_parts) > 2:
        row['Category'] = address_parts[0]+"-"+address_parts[1]
    elif len(address_parts) > 0:
        row['Category'] = address_parts[0]
            
    return row

df_html_norm = df_html_norm.apply(apply_cat, axis=1)

fname = 'cat_norm_' + (filename or domain.replace('.','_') + ".csv")

df_html_norm.to_csv(fname)
df_html_norm.head()