# Temporal OSINT Analysis with Wayback Machine

## Overview

This notebook demonstrates various techniques for conducting **Temporal Open Source Intelligence (OSINT)** analysis using the Wayback Machine. Temporal OSINT involves analyzing how websites, content, and digital footprints change over time to gather intelligence and track historical information.

## Key Concepts

- **Wayback Machine**: Internet Archive's digital time capsule that stores snapshots of websites over time
- **Temporal Analysis**: Examining changes in content, structure, and metadata across different time periods
- **OSINT Applications**: Intelligence gathering, digital forensics, brand monitoring, and research

## Tools Used

- **waybackpy**: Python library for accessing Wayback Machine programmatically
- **requests**: HTTP library for fetching web content
- **difflib**: Library for comparing text differences between snapshots

## Target Website

We'll be using **RTÉ (Raidió Teilifís Éireann)** - Ireland's national public service broadcaster - as our example target:
- **URL**: https://www.rte.ie
- **Why RTÉ**: Well-established news organization with frequent content updates, making it ideal for temporal analysis

---

## Environment Setup and Prerequisites

Before running the examples in this notebook, you need to set up the environment and install the required packages. This section provides both virtual environment setup and direct installation methods.

### Option 1: Using Virtual Environment (Recommended)

Setting up a virtual environment ensures that dependencies don't conflict with your system Python packages.

In [1]:
# Virtual Environment Setup (Run in Terminal/Command Prompt)
# These commands should be run in your terminal/command prompt, not in the notebook

import subprocess
import sys
import os

def setup_virtual_environment():
    """
    Set up a virtual environment for the project.
    This function provides instructions and can create the venv if needed.
    """
    print("Virtual Environment Setup Instructions:")
    print("=" * 50)
    print("1. Open Terminal/Command Prompt")
    print("2. Navigate to your project directory")
    print("3. Run the following commands:")
    print()
    print("   # Create virtual environment")
    print("   python -m venv temporal-osint-env")
    print()
    print("   # Activate virtual environment")
    print("   # On Windows:")
    print("   temporal-osint-env\\Scripts\\activate")
    print("   # On macOS/Linux:")
    print("   source temporal-osint-env/bin/activate")
    print()
    print("   # Install required packages")
    print("   pip install --upgrade pip")
    print("   pip install waybackpy requests beautifulsoup4 pandas matplotlib lxml")
    print()
    print("4. After activation, you can run Jupyter:")
    print("   pip install jupyter")
    print("   jupyter notebook")
    print()
    print("Current Python executable:", sys.executable)
    print("Current working directory:", os.getcwd())

# Run the setup instructions
setup_virtual_environment()

Virtual Environment Setup Instructions:
1. Open Terminal/Command Prompt
2. Navigate to your project directory
3. Run the following commands:

   # Create virtual environment
   python -m venv temporal-osint-env

   # Activate virtual environment
   # On Windows:
   temporal-osint-env\Scripts\activate
   # On macOS/Linux:
   source temporal-osint-env/bin/activate

   # Install required packages
   pip install --upgrade pip
   pip install waybackpy requests beautifulsoup4 pandas matplotlib lxml

4. After activation, you can run Jupyter:
   pip install jupyter
   jupyter notebook

Current Python executable: /usr/bin/python3
Current working directory: /content


In [None]:
# Option 2: Direct Installation in Notebook Environment
# Install required packages directly in the current notebook environment

import subprocess
import sys
import importlib.util

def install_package(package_name):
    """Install a package using pip."""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        print(f"✓ Successfully installed {package_name}")
    except subprocess.CalledProcessError:
        print(f"✗ Failed to install {package_name}")

def check_package_installed(package_name):
    """Check if a package is already installed."""
    spec = importlib.util.find_spec(package_name)
    return spec is not None

# List of required packages
required_packages = [
    "waybackpy",
    "requests",
    "beautifulsoup4",
    "pandas",
    "matplotlib",
    "lxml"
]

print("Checking and installing required packages...")
print("=" * 50)

for package in required_packages:
    # Special handling for beautifulsoup4 (imported as bs4)
    import_name = "bs4" if package == "beautifulsoup4" else package

    if check_package_installed(import_name):
        print(f"✓ {package} is already installed")
    else:
        print(f"Installing {package}...")
        install_package(package)

print("\nInstallation complete!")
print("You can now run the temporal OSINT examples below.")

In [None]:
# Verify Package Installation and Versions
# This cell verifies that all required packages are properly installed

import sys
print("Python Environment Information:")
print("=" * 40)
print(f"Python Version: {sys.version}")
print(f"Python Executable: {sys.executable}")
print()

print("Package Versions:")
print("=" * 40)

# Check each required package
packages_to_check = [
    ('waybackpy', 'waybackpy'),
    ('requests', 'requests'),
    ('beautifulsoup4', 'bs4'),
    ('pandas', 'pandas'),
    ('matplotlib', 'matplotlib'),
    ('lxml', 'lxml')
]

all_packages_available = True

for package_name, import_name in packages_to_check:
    try:
        module = __import__(import_name)
        version = getattr(module, '__version__', 'Unknown')
        print(f"✓ {package_name}: {version}")
    except ImportError:
        print(f"✗ {package_name}: NOT INSTALLED")
        all_packages_available = False

print()
if all_packages_available:
    print("🎉 All required packages are installed and ready!")
    print("You can now proceed with the temporal OSINT examples.")
else:
    print("⚠️  Some packages are missing. Please install them before continuing.")
    print("Run the installation cell above or use pip install in your terminal.")

---

# Temporal OSINT Examples

The following examples demonstrate various techniques for conducting temporal OSINT analysis using the Wayback Machine. Each example is self-contained and focuses on a specific aspect of temporal analysis.

**Important:** Make sure to run the package installation and verification cells above before proceeding with these examples.

---

In [None]:
# ============================================================================
# EXAMPLE 1: Basic Snapshot Discovery
# ============================================================================
# This example demonstrates how to discover all available snapshots of a website
# and display basic information about each snapshot.

from waybackpy import WaybackMachineCDXServerAPI

# Target configuration
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"

print("EXAMPLE 1: Basic Snapshot Discovery")
print("=" * 50)

# Initialize the CDX Server API
wayback = WaybackMachineCDXServerAPI(url, user_agent)

print(f"Discovering snapshots for: {url}")
print("=" * 50)

# Limit output to first 10 snapshots for readability
snapshot_count = 0
for snapshot in wayback.snapshots():
    print(f"Snapshot #{snapshot_count + 1}:")
    print(f"  Timestamp: {snapshot.timestamp}")
    print(f"  Archive URL: {snapshot.archive_url}")
    print(f"  Status Code: {snapshot.statuscode}")
    print(f"  MIME Type: {snapshot.mimetype}")
    print("-" * 30)

    snapshot_count += 1
    if snapshot_count >= 10:
        print("... (showing first 10 snapshots)")
        break

print(f"\nExample 1 completed: Found {snapshot_count} snapshots (limited display)")
print("=" * 50)

19961114070121 https://web.archive.org/web/19961114070121/http://www.rte.ie:80/
19961228161552 https://web.archive.org/web/19961228161552/http://www.rte.ie:80/
19970121091307 https://web.archive.org/web/19970121091307/http://www.rte.ie:80/
19970310134836 https://web.archive.org/web/19970310134836/http://www.rte.ie:80/
19970310134836 https://web.archive.org/web/19970310134836/http://www.rte.ie:80/
19970620085502 https://web.archive.org/web/19970620085502/http://www.rte.ie:80/
19981202165724 https://web.archive.org/web/19981202165724/http://www.rte.ie:80/
19990125091746 https://web.archive.org/web/19990125091746/http://rte.ie:80/
19990208010051 https://web.archive.org/web/19990208010051/http://rte.ie:80/
19990221130658 https://web.archive.org/web/19990221130658/http://www.rte.ie:80/
19990223193301 https://web.archive.org/web/19990223193301/http://www.rte.ie:80/
19990223224534 https://web.archive.org/web/19990223224534/http://www.rte.ie:80/
19990224022746 https://web.archive.org/web/19990

KeyboardInterrupt: 

In [None]:
# ============================================================================
# EXAMPLE 2: Website Content Comparison Between Time Periods
# ============================================================================
# This example compares two specific snapshots of RTÉ to identify changes
# and demonstrates diff analysis between historical versions.

import requests
from difflib import unified_diff

print("EXAMPLE 2: Website Content Comparison")
print("=" * 50)

# Two specific snapshots of RTÉ from 2023
old_url = "https://web.archive.org/web/20230519081540/https://www.rte.ie"
new_url = "https://web.archive.org/web/20230605155810/https://www.rte.ie"

print("Fetching historical snapshots...")
print(f"Old snapshot: {old_url}")
print(f"New snapshot: {new_url}")
print("=" * 70)

try:
    # Fetch the content from both snapshots
    print("Downloading old snapshot...")
    old_page = requests.get(old_url, timeout=30).text.splitlines()

    print("Downloading new snapshot...")
    new_page = requests.get(new_url, timeout=30).text.splitlines()

    # Generate unified diff
    diff = unified_diff(
        old_page,
        new_page,
        fromfile='RTÉ - May 19, 2023',
        tofile='RTÉ - June 5, 2023',
        lineterm=''
    )

    # Display differences (limit output for readability)
    diff_lines = list(diff)
    if diff_lines:
        print(f"Found {len(diff_lines)} differences between snapshots:")
        print("=" * 50)
        print("\n".join(diff_lines[:50]))  # Show first 50 lines
        if len(diff_lines) > 50:
            print(f"\n... ({len(diff_lines) - 50} more lines of differences)")
    else:
        print("No differences found between the two snapshots.")

    print(f"\nExample 2 completed: Analysis of {len(diff_lines)} difference lines")
    print("=" * 50)

except Exception as e:
    print(f"Error fetching snapshots: {e}")
    print("This might be due to network issues or snapshot availability.")
    print("=" * 50)

--- 2020

+++ 2022

@@ -31,7 +31,7 @@

       }
     </script>
     <!-- is_embedded: False; context:  -->
-    <!-- navbar https://archive.org/web/navbar.php 0.05193s -->
+    <!-- navbar https://archive.org/web/navbar.php 0.08036s -->
     <!-- navbar script -->
     <script src="//archive.org/includes/athena.js?v=64456e44" type="text/javascript"></script>
     <script src="//archive.org/includes/apollo.js?v=64456e44" type="text/javascript"></script>
@@ -45,7 +45,7 @@

     <meta property="mediatype" content="">
     <meta property="primary_collection" content="">
     <!-- navbar end -->
-    <script type="text/javascript">if('archive_analytics' in window){var v=archive_analytics.values;v.path='/web';v.service='wb';v.server_name='wwwb-app225.us.archive.org';v.server_ms=1695;archive_analytics.send_pageview_on_load()}</script>
+    <script type="text/javascript">if('archive_analytics' in window){var v=archive_analytics.values;v.path='/web';v.service='wb';v.server_name='wwwb-app201.us.

## Example 3: Temporal Analysis with Date Range Filtering

This example demonstrates how to analyze snapshots within a specific date range to track changes during particular time periods (e.g., during major news events, elections, or crises).

In [None]:
# ============================================================================
# EXAMPLE 3: Date Range Filtering for Temporal Analysis
# ============================================================================
# Analyze RTÉ snapshots during specific time periods (e.g., during major events)

from waybackpy import WaybackMachineCDXServerAPI
from datetime import datetime, timedelta
import pandas as pd

print("EXAMPLE 3: Date Range Filtering")
print("=" * 50)

# Initialize wayback API
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"
wayback = WaybackMachineCDXServerAPI(url, user_agent)

# Define analysis period (e.g., during a major event)
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 6, 30)

print(f"Analyzing RTÉ snapshots from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
print("=" * 70)

# Collect snapshots in date range
snapshots_data = []
total_snapshots_checked = 0

for snapshot in wayback.snapshots():
    total_snapshots_checked += 1

    # Parse timestamp
    try:
        snapshot_date = datetime.strptime(snapshot.timestamp, '%Y%m%d%H%M%S')
    except ValueError:
        continue

    # Filter by date range
    if start_date <= snapshot_date <= end_date:
        snapshots_data.append({
            'timestamp': snapshot.timestamp,
            'date': snapshot_date.strftime('%Y-%m-%d'),
            'time': snapshot_date.strftime('%H:%M:%S'),
            'archive_url': snapshot.archive_url,
            'status_code': snapshot.statuscode
        })

# Create DataFrame for analysis
df = pd.DataFrame(snapshots_data)

if not df.empty:
    print(f"Found {len(df)} snapshots in the specified date range")
    print(f"(Checked {total_snapshots_checked} total snapshots)")

    print("\nSnapshot frequency by month:")
    df['month'] = pd.to_datetime(df['date']).dt.to_period('M')
    monthly_counts = df.groupby('month').size()
    print(monthly_counts)

    print("\nFirst 10 snapshots in range:")
    print(df.head(10)[['date', 'time', 'status_code']])

    print(f"\nExample 3 completed: Analyzed {len(df)} snapshots in date range")
    print("=" * 50)
else:
    print("No snapshots found in the specified date range")
    print("=" * 50)

## Example 4: Metadata Analysis and Change Detection

This example focuses on analyzing metadata changes over time, such as HTTP status codes, MIME types, and response sizes to identify technical changes, outages, or structural modifications to the website.

In [None]:
# ============================================================================
# EXAMPLE 4: Metadata Analysis and Change Detection
# ============================================================================
# Analyze technical changes and anomalies in website snapshots

from waybackpy import WaybackMachineCDXServerAPI
from collections import Counter
import pandas as pd

print("EXAMPLE 4: Metadata Analysis and Change Detection")
print("=" * 50)

# Initialize wayback API
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"
wayback = WaybackMachineCDXServerAPI(url, user_agent)

print("Analyzing RTÉ metadata changes over time...")
print("=" * 50)

# Collect metadata from snapshots
metadata_list = []
snapshot_count = 0

for snapshot in wayback.snapshots():
    metadata_list.append({
        'timestamp': snapshot.timestamp,
        'status_code': snapshot.statuscode,
        'mime_type': snapshot.mimetype,
        'digest': snapshot.digest,
        'length': getattr(snapshot, 'length', 'N/A')
    })

    snapshot_count += 1
    if snapshot_count >= 100:  # Limit for analysis
        break

# Create DataFrame
df = pd.DataFrame(metadata_list)

# Analyze status codes
print("Status Code Distribution:")
print("-" * 30)
status_counts = df['status_code'].value_counts()
print(status_counts)

# Analyze MIME types
print("\nMIME Type Distribution:")
print("-" * 30)
mime_counts = df['mime_type'].value_counts()
print(mime_counts)

# Detect anomalies (non-200 status codes)
anomalies = df[df['status_code'] != '200']
if not anomalies.empty:
    print(f"\nDetected {len(anomalies)} anomalies (non-200 status codes):")
    print("-" * 30)
    print(anomalies[['timestamp', 'status_code', 'mime_type']].head(10))
else:
    print("\nNo anomalies detected - all snapshots returned status 200")

# Analyze unique content digests (indicates content changes)
unique_digests = df['digest'].nunique()
total_snapshots = len(df)
change_ratio = (unique_digests / total_snapshots) * 100

print(f"\nContent Change Analysis:")
print("-" * 30)
print(f"Total snapshots analyzed: {total_snapshots}")
print(f"Unique content digests: {unique_digests}")
print(f"Content change ratio: {change_ratio:.1f}%")
print(f"Interpretation: {'High content volatility' if change_ratio > 80 else 'Moderate content changes' if change_ratio > 50 else 'Low content volatility'}")

print(f"\nExample 4 completed: Analyzed {total_snapshots} snapshots")
print("=" * 50)

## Example 5: Near-Time vs. Historical Comparison

This example demonstrates how to compare a website's current state with historical snapshots to identify recent changes, which is particularly useful for monitoring and intelligence purposes.

In [None]:
# ============================================================================
# EXAMPLE 5: Near-Time vs. Historical Comparison
# ============================================================================
# Compare current website state with historical snapshots

from waybackpy import WaybackMachineCDXServerAPI
import requests
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import re

print("EXAMPLE 5: Near-Time vs. Historical Comparison")
print("=" * 50)

# Initialize APIs
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"
wayback = WaybackMachineCDXServerAPI(url, user_agent)

print("Comparing current RTÉ website with historical snapshots...")
print("=" * 60)

# Get the most recent snapshot
try:
    latest_snapshot = wayback.newest()
    print(f"Most recent snapshot: {latest_snapshot.timestamp}")
    print(f"Archive URL: {latest_snapshot.archive_url}")

    # Get current website content
    print("\nFetching current website content...")
    current_response = requests.get(url, timeout=30)
    current_soup = BeautifulSoup(current_response.text, 'html.parser')

    # Get historical snapshot content
    print("Fetching historical snapshot content...")
    historical_response = requests.get(latest_snapshot.archive_url, timeout=30)
    historical_soup = BeautifulSoup(historical_response.text, 'html.parser')

    # Compare key elements
    print("\nComparison Results:")
    print("-" * 30)

    # Compare page titles
    current_title = current_soup.title.string if current_soup.title else "No title"
    historical_title = historical_soup.title.string if historical_soup.title else "No title"

    print(f"Current title: {current_title}")
    print(f"Historical title: {historical_title}")
    print(f"Title changed: {'Yes' if current_title != historical_title else 'No'}")

    # Compare number of links
    current_links = len(current_soup.find_all('a', href=True))
    historical_links = len(historical_soup.find_all('a', href=True))

    print(f"\nCurrent links count: {current_links}")
    print(f"Historical links count: {historical_links}")
    print(f"Link count difference: {current_links - historical_links}")

    # Compare meta descriptions
    current_meta = current_soup.find('meta', attrs={'name': 'description'})
    historical_meta = historical_soup.find('meta', attrs={'name': 'description'})

    current_desc = current_meta['content'] if current_meta else "No description"
    historical_desc = historical_meta['content'] if historical_meta else "No description"

    print(f"\nMeta description changed: {'Yes' if current_desc != historical_desc else 'No'}")

    # Analyze content size difference
    current_size = len(current_response.text)
    historical_size = len(historical_response.text)
    size_diff = current_size - historical_size
    size_change_percent = (size_diff / historical_size) * 100 if historical_size > 0 else 0

    print(f"\nContent size analysis:")
    print(f"Current size: {current_size:,} characters")
    print(f"Historical size: {historical_size:,} characters")
    print(f"Size difference: {size_diff:,} characters ({size_change_percent:.1f}%)")

    print(f"\nExample 5 completed: Compared current vs historical snapshot")
    print("=" * 50)

except Exception as e:
    print(f"Error during comparison: {e}")
    print("This might be due to website accessibility or rate limiting")
    print("=" * 50)

## Example 6: Keyword and Content Tracking

This example demonstrates how to track specific keywords, topics, or content patterns across time to monitor coverage of particular subjects, detect censorship, or analyze editorial changes.

In [None]:
# ============================================================================
# EXAMPLE 6: Keyword and Content Tracking
# ============================================================================
# Track specific keywords or topics across time periods

from waybackpy import WaybackMachineCDXServerAPI
import requests
import re
from datetime import datetime
from bs4 import BeautifulSoup
import pandas as pd

print("EXAMPLE 6: Keyword and Content Tracking")
print("=" * 50)

# Initialize wayback API
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"
wayback = WaybackMachineCDXServerAPI(url, user_agent)

# Keywords to track (relevant to Irish/RTÉ context)
keywords = ['brexit', 'covid', 'ukraine', 'election', 'housing', 'climate']

print("Tracking keyword frequency across RTÉ snapshots...")
print(f"Target keywords: {', '.join(keywords)}")
print("=" * 60)

# Sample multiple snapshots across time
snapshot_data = []
processed_snapshots = 0

for snapshot in wayback.snapshots():
    if processed_snapshots >= 20:  # Limit for demo purposes
        break

    try:
        # Skip if not a successful snapshot
        if snapshot.statuscode != '200':
            continue

        print(f"Processing snapshot {processed_snapshots + 1}/20: {snapshot.timestamp}")

        # Fetch snapshot content
        response = requests.get(snapshot.archive_url, timeout=15)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract text content
        text_content = soup.get_text().lower()

        # Count keyword occurrences
        keyword_counts = {}
        for keyword in keywords:
            count = len(re.findall(r'\b' + keyword + r'\b', text_content))
            keyword_counts[keyword] = count

        # Store results
        snapshot_data.append({
            'timestamp': snapshot.timestamp,
            'date': datetime.strptime(snapshot.timestamp, '%Y%m%d%H%M%S').strftime('%Y-%m-%d'),
            'total_words': len(text_content.split()),
            **keyword_counts
        })

        processed_snapshots += 1

    except Exception as e:
        print(f"Error processing snapshot {snapshot.timestamp}: {e}")
        continue

# Create DataFrame for analysis
df = pd.DataFrame(snapshot_data)

if not df.empty:
    print(f"\nAnalyzed {len(df)} snapshots")
    print("\nKeyword frequency analysis:")
    print("-" * 30)

    # Calculate total keyword mentions
    keyword_totals = {keyword: df[keyword].sum() for keyword in keywords}
    sorted_keywords = sorted(keyword_totals.items(), key=lambda x: x[1], reverse=True)

    print("Total keyword mentions across all snapshots:")
    for keyword, count in sorted_keywords:
        print(f"  {keyword}: {count} mentions")

    # Find snapshots with highest keyword density
    print("\nSnapshots with highest keyword activity:")
    df['total_keywords'] = df[keywords].sum(axis=1)
    top_snapshots = df.nlargest(5, 'total_keywords')

    for _, row in top_snapshots.iterrows():
        print(f"  {row['date']}: {row['total_keywords']} total keyword mentions")
        active_keywords = [kw for kw in keywords if row[kw] > 0]
        print(f"    Active keywords: {', '.join(active_keywords)}")

    # Timeline analysis
    print("\nKeyword timeline (chronological order):")
    df_sorted = df.sort_values('date')
    for _, row in df_sorted.head(10).iterrows():
        active_kw = [f"{kw}({row[kw]})" for kw in keywords if row[kw] > 0]
        if active_kw:
            print(f"  {row['date']}: {', '.join(active_kw)}")

    print(f"\nExample 6 completed: Analyzed {len(df)} snapshots for keyword tracking")
    print("=" * 50)
else:
    print("No snapshots were successfully analyzed")
    print("=" * 50)

## Example 7: Archive Availability and Gap Analysis

This example analyzes the availability of snapshots over time, identifies gaps in archival coverage, and provides insights into when a website might have been inaccessible or undergone significant changes.

In [None]:
# ============================================================================
# EXAMPLE 7: Archive Availability and Gap Analysis
# ============================================================================
# Analyze temporal coverage and identify gaps in archival data

from waybackpy import WaybackMachineCDXServerAPI
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

print("EXAMPLE 7: Archive Availability and Gap Analysis")
print("=" * 50)

# Initialize wayback API
url = "https://www.rte.ie"
user_agent = "Mozilla/5.0 (compatible; TemporalOSINTBot/1.0)"
wayback = WaybackMachineCDXServerAPI(url, user_agent)

print("Analyzing archive availability and gaps for RTÉ...")
print("=" * 50)

# Collect all snapshots with timestamps
snapshots = []
total_processed = 0

for snapshot in wayback.snapshots():
    total_processed += 1
    try:
        timestamp = datetime.strptime(snapshot.timestamp, '%Y%m%d%H%M%S')
        snapshots.append({
            'timestamp': timestamp,
            'year': timestamp.year,
            'month': timestamp.month,
            'status_code': snapshot.statuscode,
            'archive_url': snapshot.archive_url
        })
    except ValueError:
        continue  # Skip invalid timestamps

    # Limit processing for demo
    if total_processed >= 500:
        break

# Create DataFrame
df = pd.DataFrame(snapshots)

if not df.empty:
    # Sort by timestamp
    df = df.sort_values('timestamp')

    print(f"Total snapshots analyzed: {len(df)} (from {total_processed} processed)")
    print(f"Date range: {df['timestamp'].min().strftime('%Y-%m-%d')} to {df['timestamp'].max().strftime('%Y-%m-%d')}")
    print(f"Total years covered: {df['timestamp'].max().year - df['timestamp'].min().year + 1}")

    # Analyze by year
    yearly_counts = df.groupby('year').size().sort_index()
    print("\nSnapshots per year:")
    print("-" * 20)
    for year, count in yearly_counts.items():
        print(f"  {year}: {count} snapshots")

    # Find gaps (periods with no snapshots)
    print("\nAnalyzing gaps in coverage...")
    print("-" * 30)

    # Calculate time differences between consecutive snapshots
    df['time_diff'] = df['timestamp'].diff()

    # Find large gaps (> 30 days)
    large_gaps = df[df['time_diff'] > timedelta(days=30)]

    if not large_gaps.empty:
        print(f"Found {len(large_gaps)} significant gaps (>30 days):")
        for idx, row in large_gaps.head(10).iterrows():  # Show first 10 gaps
            prev_snapshot = df.iloc[idx-1]['timestamp']
            gap_duration = row['time_diff'].days
            print(f"  Gap: {prev_snapshot.strftime('%Y-%m-%d')} to {row['timestamp'].strftime('%Y-%m-%d')} ({gap_duration} days)")
    else:
        print("No significant gaps found (all gaps < 30 days)")

    # Monthly distribution analysis
    print("\nMonthly distribution (recent 3 years):")
    print("-" * 30)
    recent_years = df[df['year'] >= df['year'].max() - 2]

    monthly_counts = recent_years.groupby('month').size()
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

    for month_num in range(1, 13):
        count = monthly_counts.get(month_num, 0)
        print(f"  {months[month_num-1]}: {count} snapshots")

    # Status code analysis
    print("\nStatus code distribution:")
    print("-" * 30)
    status_counts = df['status_code'].value_counts()
    for status, count in status_counts.items():
        percentage = (count / len(df)) * 100
        print(f"  {status}: {count} snapshots ({percentage:.1f}%)")

    # Calculate archival frequency metrics
    total_days = (df['timestamp'].max() - df['timestamp'].min()).days
    avg_frequency = total_days / len(df) if len(df) > 0 else 0

    print(f"\nArchival frequency metrics:")
    print("-" * 30)
    print(f"  Average days between snapshots: {avg_frequency:.1f}")

    if not df['time_diff'].isna().all():
        longest_gap = df['time_diff'].max()
        print(f"  Longest gap: {longest_gap.days} days")
    else:
        print("  Longest gap: N/A")

    # Quality assessment
    successful_snapshots = len(df[df['status_code'] == '200'])
    success_rate = (successful_snapshots / len(df)) * 100

    print(f"\nArchive quality assessment:")
    print("-" * 30)
    print(f"  Successful snapshots (200 status): {successful_snapshots}/{len(df)} ({success_rate:.1f}%)")
    print(f"  Archive coverage quality: {'Excellent' if success_rate > 90 else 'Good' if success_rate > 80 else 'Fair' if success_rate > 70 else 'Poor'}")

    print(f"\nExample 7 completed: Analyzed {len(df)} snapshots for gaps and availability")
    print("=" * 50)

else:
    print("No snapshots found for analysis")
    print("=" * 50)

## Best Practices and Ethical Considerations

### Technical Best Practices

1. **Rate Limiting**: Always implement delays between requests to avoid overwhelming the Wayback Machine servers
2. **Error Handling**: Implement robust error handling for network issues, timeouts, and missing snapshots
3. **User Agent**: Use a descriptive user agent string to identify your research purpose
4. **Timeout Settings**: Set appropriate timeouts for HTTP requests to handle slow responses
5. **Data Validation**: Validate timestamps and URLs before processing
6. **Caching**: Consider caching results to avoid redundant API calls

### Ethical Considerations

1. **Respect Terms of Service**: Follow Internet Archive's terms of service and API guidelines
2. **Academic/Research Purpose**: Use these techniques for legitimate research, journalism, or academic purposes
3. **Privacy Considerations**: Be mindful of archived personal information and privacy implications
4. **Attribution**: Credit the Internet Archive and waybackpy library in your research
5. **Legal Compliance**: Ensure your research complies with local laws and regulations

### Common Use Cases for Temporal OSINT

- **Digital Forensics**: Investigating website changes during specific time periods
- **Brand Monitoring**: Tracking how organizations' online presence evolves
- **Disinformation Research**: Analyzing how false information spreads and evolves
- **Academic Research**: Studying digital culture and web evolution
- **Journalism**: Investigating claims about past statements or positions
- **Compliance**: Verifying historical compliance with regulations or policies

### Limitations and Considerations

1. **Archive Coverage**: Not all websites are equally well-archived
2. **Dynamic Content**: JavaScript-heavy sites may not be fully captured
3. **Robots.txt**: Some sites may block archiving
4. **Temporal Gaps**: Archives may have gaps during outages or restrictions
5. **Content Changes**: Archived content may differ from original due to technical limitations

---

## Conclusion

This notebook has demonstrated various techniques for conducting temporal OSINT analysis using the Wayback Machine and waybackpy library. The examples focused on RTÉ (www.rte.ie) and covered:

1. **Basic Snapshot Discovery** - Finding available snapshots
2. **Content Comparison** - Comparing snapshots across time periods
3. **Date Range Filtering** - Analyzing specific time periods
4. **Metadata Analysis** - Technical change detection
5. **Near-Time Comparison** - Current vs. historical analysis
6. **Keyword Tracking** - Monitoring specific topics over time
7. **Gap Analysis** - Understanding archival coverage patterns

These techniques provide a foundation for more advanced temporal OSINT investigations and can be adapted for various research purposes. Remember to always conduct your research ethically and in compliance with applicable laws and terms of service.

### Next Steps

- Experiment with different websites and time periods
- Combine multiple analytical techniques for comprehensive investigations
- Consider integrating with other OSINT tools and data sources
- Develop automated monitoring systems for ongoing analysis
- Share your findings responsibly with the appropriate communities

---

*Notebook created for educational and research purposes. Always use these techniques responsibly and ethically.*