# Introduction

To understand the data present in `search_history.json`, we
- Wrote `extract_schema` and `merge_schemas`: a mapping and reducer function that follows the mapReduce pattern to batch process 
  search_history.json entries in parallel to extract schema fields.
- We check for any falsy values; that is, empty lists, empty dictionaries, empty strings, and whitespace only results.
- We randomly sampled different entries to infer what the optional fields did; this was done as there was no documentation online for what
  each optional field did.

The search history data we are given in `search_history.json` is a Google Takeout export of search activity containing 55383 search entries
accumulated over 7 years from June 2017 to June 2024 with no falsy values. The lists below describe the schema and the meaning of both mandatory and optional fields in the schema.

**Mandatory Fields**

- `header` (str): The Google product category. Only "Search" was observed across all entries.
- `title` (str): Description of the activity. The following templates have been observed:

  | Template                   | Meaning                                                    |
  |----------------------------|------------------------------------------------------------|
  | "Searched for ..."         | User ran a Google search                                   |
  | "Visited ..."              | User visited a website from search results                 |
  | "Viewed ..."               | User clicked on a Google Maps entry from search results    |
  | "1 notification"           | User received a notification from Google Alerts            |
  | "Used Search"              | Unknown - appears sparsely (170 times)                     |
  | "Ran internet speed test"  | Unknown - appears only once                                |

  Here, the "..." is either a search string or a URL.
- `time` (str): ISO 8601 timestamp of when the activity occured. In the data, it ranged from 2017-06-08 16:42:55.223000+00:00 to
  2024-06-23 22:21:50.431000+00:00 for a total duration of 2572 days (7.0 years).
- `products` (list): Google products involved. Only ["Search"] was observed across all entries.
- `activityControls` (list): Which activity control settings captured this data. Only ["Web & App Activity"] was observed across all entries.

**Optional Fields**

- `titleUrl` (str, 55,021 / 99.3%): The URL associated with the activity. In the data, it only appeared for the `title` templates "Searched
  for ..." and "Visited ...".
- `details` (list, 2,794 / 5.0%): Additional details about the activity. In the data, when the field was present, 
  - `details[].name` (str): Name/description within details
- `locationInfos` (list, 226 / 0.4%): Location data when search was made
  - `locationInfos[].name` (str): Location description (e.g., "At this general area")
  - `locationInfos[].url` (str): Google Maps link to the location
  - `locationInfos[].source` (str): Source of location (e.g., "From your places (Home)")
  - `locationInfos[].sourceUrl` (str, 225 / 0.4%): Help URL explaining the location source
- `subtitles` (list, 166 / 0.3%): Additional context like notification topics
  - `subtitles[].name` (str): Subtitle text (e.g., topic names like "Reuters")
  - `subtitles[].url` (str, 8): Optional URL within subtitle

In [1]:
import json
from concurrent.futures import ThreadPoolExecutor
from functools import reduce


def extract_schema(obj, prefix=""):
    """Map: extract all key paths and types from a dict."""
    paths = {}
    match obj:
        case dict(d):
            for k, v in d.items():
                path = f"{prefix}.{k}" if prefix else k
                paths[path] = {"types": {type(v).__name__}, "count": 1}
                nested = extract_schema(v, path)
                for p, info in nested.items():
                    paths[p] = info
        case list(items) if items:
            paths[f"{prefix}[]"] = {"types": {"list"}, "count": 1}
            nested = extract_schema(items[0], f"{prefix}[]")
            for p, info in nested.items():
                paths[p] = info
        case list():
            paths[f"{prefix}[]"] = {"types": {"list"}, "count": 1}
    return paths


def merge_schemas(a, b):
    """Reduce: merge two schemas, combining types and counts."""
    result = dict(a)
    for k, v in b.items():
        if k in result:
            result[k] = {
                "types": result[k]["types"] | v["types"],
                "count": result[k]["count"] + v["count"],
            }
        else:
            result[k] = v
    return result


def find_empty_values(obj, prefix=""):
    """Map: find all empty/null values using structural pattern matching."""
    issues = {}
    match obj:
        case None:
            issues[f"{prefix} (null)"] = 1
        case "":
            issues[f"{prefix} (empty string)"] = 1
        case str(s) if s.isspace():
            issues[f"{prefix} (whitespace only)"] = 1
        case list() if not obj:
            issues[f"{prefix} (empty list)"] = 1
        case dict() if not obj:
            issues[f"{prefix} (empty dict)"] = 1
        case dict(d):
            for k, v in d.items():
                path = f"{prefix}.{k}" if prefix else k
                nested = find_empty_values(v, path)
                for p, count in nested.items():
                    issues[p] = issues.get(p, 0) + count
        case list(items):
            for item in items:
                nested = find_empty_values(item, f"{prefix}[]")
                for p, count in nested.items():
                    issues[p] = issues.get(p, 0) + count
        case _:
            pass  # Non-empty primitive value
    return issues


def merge_issues(a, b):
    """Reduce: merge two issue dicts, summing counts."""
    result = dict(a)
    for k, v in b.items():
        result[k] = result.get(k, 0) + v
    return result


if __name__ == "__main__":
    with open("./search_history.json", "r") as f:
        data = json.load(f)

    total = len(data)
    print(f"Processing {total} entries...\n")

    with ThreadPoolExecutor() as executor:
        schemas = list(executor.map(extract_schema, data))
        empty_values = list(executor.map(find_empty_values, data))

    # Schema analysis
    full_schema = reduce(merge_schemas, schemas, {})

    mandatory = []
    optional = []

    for path in sorted(full_schema.keys()):
        info = full_schema[path]
        types = ", ".join(sorted(info["types"]))
        count = info["count"]

        if count == total:
            mandatory.append((path, types))
        else:
            optional.append((path, types, count))

    print("=== MANDATORY FIELDS (100%) ===")
    for path, types in mandatory:
        print(f"  {path}: {types}")

    print(f"\n=== OPTIONAL FIELDS ===")
    for path, types, count in optional:
        print(f"  {path}: {types} ({count}/{total})")

    # Empty values analysis
    combined = reduce(merge_issues, empty_values, {})

    if combined:
        print("\n=== EMPTY/NULL VALUES FOUND ===")
        for path, count in sorted(combined.items(), key=lambda x: -x[1]):
            print(f"  {path}: {count} occurrences")
    else:
        print("\nNo empty or null values found.")

Processing 55383 entries...



=== MANDATORY FIELDS (100%) ===
  activityControls: list
  activityControls[]: list
  header: str
  products: list
  products[]: list
  time: str
  title: str

=== OPTIONAL FIELDS ===
  details: list (2794/55383)
  details[]: list (2794/55383)
  details[].name: str (2794/55383)
  locationInfos: list (226/55383)
  locationInfos[]: list (226/55383)
  locationInfos[].name: str (226/55383)
  locationInfos[].source: str (226/55383)
  locationInfos[].sourceUrl: str (225/55383)
  locationInfos[].url: str (226/55383)
  subtitles: list (166/55383)
  subtitles[]: list (166/55383)
  subtitles[].name: str (166/55383)
  subtitles[].url: str (8/55383)
  titleUrl: str (55021/55383)

No empty or null values found.


In [None]:
import json
import random

OPTIONAL_FIELDS = ["titleUrl", "details", "locationInfos", "subtitles"]
SAMPLE_SIZE = 5


def sample_by_field(data, field, n=SAMPLE_SIZE):
    """Get random sample of entries containing the specified field."""
    with_field = [e for e in data if field in e]
    without_field = [e for e in data if field not in e]
    return {
        "with": random.sample(with_field, min(n, len(with_field))),
        "without": random.sample(without_field, min(n, len(without_field))),
    }


def format_samples(samples, field):
    """Format samples for a field as a single string."""
    with_entries = "\n".join(
        f"""
Sample {i}:
  title: {entry.get('title', 'N/A')}
  {field}: {entry.get(field)}"""
        for i, entry in enumerate(samples["with"], 1)
    )

    without_entries = "\n".join(
        f"""
Sample {i}:
  title: {entry.get('title', 'N/A')}"""
        for i, entry in enumerate(samples["without"], 1)
    )

    return f"""
{'='*60}
FIELD: {field}
{'='*60}

--- Entries WITH {field} ---
{with_entries}

--- Entries WITHOUT {field} ---
{without_entries}
"""


if __name__ == "__main__":
    with open("./search_history.json", "r") as f:
        data = json.load(f)

    for field in OPTIONAL_FIELDS:
        samples = sample_by_field(data, field)
        print(format_samples(samples, field))

In [1]:
import json
from collections import Counter
from datetime import datetime

CATEGORICAL_FIELDS = ["header", "products", "activityControls"]


def summarise_categorical(data, field):
    """Count occurrences of each unique value for a categorical field."""
    values = []
    for entry in data:
        val = entry.get(field)
        if isinstance(val, list):
            values.extend(val)
        else:
            values.append(val)
    return Counter(values)


def summarise_time(data):
    """Find min and max timestamps."""
    times = [datetime.fromisoformat(e["time"].replace("Z", "+00:00")) for e in data]
    return min(times), max(times)


if __name__ == "__main__":
    with open("./search_history.json", "r") as f:
        data = json.load(f)

    for field in CATEGORICAL_FIELDS:
        counts = summarise_categorical(data, field)
        print(f"""
{'='*60}
FIELD: {field} ({len(counts)} unique values)
{'='*60}
{chr(10).join(f'  {val}: {count}' for val, count in counts.most_common())}
""")

    min_time, max_time = summarise_time(data)
    duration = max_time - min_time
    print(f"""
{'='*60}
FIELD: time
{'='*60}
  Min: {min_time}
  Max: {max_time}
  Duration: {duration.days} days ({duration.days / 365:.1f} years)
""")


FIELD: header (1 unique values)
  Search: 55383


FIELD: products (1 unique values)
  Search: 55383


FIELD: activityControls (1 unique values)
  Web & App Activity: 55383


FIELD: time
  Min: 2017-06-08 16:42:55.223000+00:00
  Max: 2024-06-23 22:21:50.431000+00:00
  Duration: 2572 days (7.0 years)

