# Scan date analysis

## Context

See Github issue: [https://github.com/GSA/site-scanning/issues/201](https://github.com/GSA/site-scanning/issues/201)

## Dependencies

In [7]:
%load_ext autoreload
%autoreload 2

%pip install --quiet pandas jinja2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas as pd

from datetime import timedelta


## Load data

Read the most recent weekly snapshot and the target URL list into Pandas dataframes.

In [9]:
snapshot_df = pd.read_json("https://api.gsa.gov/technology/site-scanning/data/weekly-snapshot.json")
snapshot_df['scan_date'] = pd.to_datetime(snapshot_df['scan_date'])

In [10]:
target_url_df = pd.read_csv('https://raw.githubusercontent.com/GSA/federal-website-index/main/data/site-scanning-target-url-list.csv')

## Get "old" scans

In [11]:
last_scan_date = snapshot_df['scan_date'].max()
fresh_scan_date = last_scan_date - timedelta(days=1)
older_scans = snapshot_df.query('scan_date < @fresh_scan_date')[['scan_date', 'target_url', 'final_url']].sort_values('scan_date')
older_scans[:10].style.set_caption(f'Ten oldest scans before {fresh_scan_date}')

Unnamed: 0,scan_date,target_url,final_url
23823,2022-05-16 01:03:52.007000+00:00,livehelp.cancer.gov,https://livehelp.cancer.gov/app/chat/chat_launch
21543,2022-05-16 01:04:50.460000+00:00,livehelp-es.cancer.gov,https://livehelp-es.cancer.gov/app/chat/chat_launch
2195,2022-05-16 01:11:01.475000+00:00,takebackday.dea.gov,
19446,2022-05-16 01:16:21.647000+00:00,www-nrd.nhtsa.dot.gov,https://www.nhtsa.gov/research
6574,2022-05-16 01:16:29.597000+00:00,edlabs.ed.gov,https://ies.ed.gov/ncee/rel/
5335,2022-05-16 01:27:20.667000+00:00,tracker.cloud.hhs.gov,https://ams.hhs.gov/amsLogin/SimpleLogin.jsp
21108,2022-05-16 02:29:03.696000+00:00,wfmi.nifc.gov,https://wfmi.nifc.gov/cgi/WfmiHome.cgi
9918,2022-05-16 02:35:53.448000+00:00,trials.nci.nih.gov,https://trials.nci.nih.gov/login/
11565,2022-05-16 02:36:05.134000+00:00,trials-stage.nci.nih.gov,https://trials-stage.nci.nih.gov/login/
17350,2022-05-16 03:01:42.336000+00:00,epsi.pppl.gov,


In [12]:
print('Domains not in the target URL list:')
set(older_scans['target_url']) - set(target_url_df['website'])

Domains not in the target URL list:


{'stgpdas.samhsa.gov'}