<a href="https://colab.research.google.com/github/CharlyWargnier/GooglebotChecker/blob/master/Googlebot_Checker_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Googlebot Checker V1

The notebook verifies that IPs in a .csv file are genuinely from Googlebots, via reverse/forward DNS lookups.

It’s a Jupyter port of [searchtools.io](https://www.searchtools.io/), created by the great [Tyler Reardon](https://twitter.com/TylerReardon).

Note that the script is still in its early days, thus:

*   You may get errors when checking a large amount of IP addresses
*   The reverse-DNS checks are currently limited to Googlebots
*   Bingbots and more user agents will be added soon!

[Reach me out on Twitter](https://twitter.com/DataChaz) if questions - Pull requests on [Github](https://github.com/CharlyWargnier/GooglebotChecker) are welcome! :)

### Import Python libraries

In [0]:
import pandas as pd
import numpy as np
import socket
import json

### Upload your .csv file to Google Colab

In [58]:
#@markdown Running this cell will prompt an upload box to upload your csv file in Colab.

#@markdown Schema requirements for your .csv are:
#@markdown *   One column containing all your IP addresses, named ’IP’
#@markdown *   One column containing your user agents strings, named ’userAgent’

#@markdown Here’s a [sample file](https://raw.githubusercontent.com/CharlyWargnier/GooglebotChecker/master/sampleFile.csv) you can try out. 

#@markdown You need to wait that the file is fully uploaded before moving to the other cells.  

#@markdown ---

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sampleFile.csv to sampleFile.csv
User uploaded file "sampleFile.csv" with length 5675092 bytes


### Convert your .csv file to a Pandas Dataframe

In [0]:
df = pd.read_csv('/content/sampleFile.csv')
#df.info()

### Add an `SEBotClass` column to categorise Search Engine bots

In [0]:
df['SEBotClass'] = np.where(df.userAgent.str.contains("YandexBot"), "YandexBot",
  np.where(df.userAgent.str.contains("bingbot"), "BingBot",
    np.where(df.userAgent.str.contains("DuckDuckBot"), "DuckDuckGo",
      np.where(df.userAgent.str.contains("Baiduspider"), "Baidu",
        np.where(df.userAgent.str.contains("Googlebot/2.1"), "GoogleBot", "Else")))))

#df.info()

### De-dupe IP addresses


In [0]:
dfUniqueIPs = df.drop_duplicates(subset=['IP'])
#dfUniqueIPs.info()

### Filter GoogleBot UserAgents


In [0]:
dfUniqueIPs = dfUniqueIPs[dfUniqueIPs['userAgent'].str.contains("Googlebot")]
dfUniqueIPs = dfUniqueIPs.loc[:, dfUniqueIPs.columns.intersection(['IP'])]
#dfUniqueIPs.info()

### Run reverse & forward DNS checks

The script mimics [Google’s instructions](https://support.google.com/webmasters/answer/80553?hl=en), as follows:

*   First, it runs a reverse DNS lookup on the IP addresses provided.
*   Then, it verifies that the domain name is a subdomain of either googlebot.com or google.com.
*   Finally, it runs a forward DNS lookup on the hostname and verifies it matches the original IP address.



In [0]:
#@markdown Running this cell will run the function (double-click to unhide the code).

#@markdown ---

def reverse_dns(ip_address):
    '''
    This method returns the true host name for a 
    given IP address
    '''
    host_name = socket.gethostbyaddr(ip_address)
    reversed_dns = host_name[0]
    return reversed_dns

def forward_dns(reversed_dns):
    '''
    This method returns the first IP address string
    that responds as the given domain name
    '''
    try:
        data = socket.gethostbyname(reversed_dns)
        ip = str(data)
        return ip
    except Exception:
        print('error')
        return False

def ip_match(ip, true_ip):
    '''
    This method takes an ip address used for a reverse dns lookup
    and an ip address returned from a forward dns lookup
    and determines if they match.
    '''
    if ip == true_ip:
        ip_match = True
    else:
        ip_match = False
    return ip_match

def confirm_googlebot(host, ip_match):
    '''
    This method takes a hostname and the results of the ip_match() method
    and determines if an ip address from a log file is truly googlebot
    '''
    googlebot = False
    if host != False:
        if host.endswith('.googlebot.com') or host.endswith('.google.com'):
            if ip_match == True:
                #googlebot = 'Yes'
                googlebot = True
    return googlebot
                
def run(ip):
    try:
        host = reverse_dns(ip)
        true_ip = forward_dns(host)
        is_match = ip_match(ip, true_ip)
        return confirm_googlebot(host, is_match)
    except:
        #return 'No'
        return False

## Step 03 - Run the function
dfUniqueIPs['isRealGbot'] = dfUniqueIPs['IP'].apply(run)


### Merge results back into the Master table `df`


In [0]:
df = df.merge(dfUniqueIPs, how='left', on='IP')
df['isRealGbot'] = df['isRealGbot'].fillna(value="n/a - other User Agents")
# Unhash if you wanted to check info for output table
df.info()

### Check results before exporting to .csv


In [0]:
df.groupby('isRealGbot').count()['IP']

### Export to .csv 

Export the Dataframe to a .csv file. Give it a name via the export form below.



In [0]:
csvName = "OutputCSV" #@param {type:"string"}
csvName = csvName + '.csv'
df.to_csv(csvName)

Voilà!