<a href="https://colab.research.google.com/github/MagicPolygon/spotify-data-analysis/blob/main/notebooks/anonymise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anonymising the data

My Extended Streaming History data that Spotify gave me includes IP addresses that I want to keep private. This notebook attempts to anonymise that data by creating a copy of the original data that has the IP addresses removed.

## Imports and Mounting

To begin, I will import everything I need and mount Google Drive so that I can access the files there:

In [None]:
import json
from pathlib import Path    # Allows file paths to be handled safely
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Paths

I will now define the paths:

In [None]:
raw_dir = Path("/content/drive/MyDrive/spotify-data-analysis/data/raw")
anonymised_dir = Path("/content/drive/MyDrive/spotify-data-analysis/data/anonymised")

## Anonymising the JSON files

For each file in the raw data folder, I will load them, remove the sensistive information, and save them in the anonymised data folder:

In [None]:
for json_file in raw_dir.glob("*.json"):    # For each element in the raw data folder's JSON files
    with open(json_file, "r", encoding="utf-8") as f:    # utf-8 preserves all of the characters
        data = json.load(f)

    for record in data:
        record.pop("ip_addr", None)    # "None" makes it so no error is raised if "ip_addr" isn't found, avoiding crashes

    output_path = anonymised_dir / json_file.name    # "/" is not division in this case, it's overloaded to join the path to the file name

    # Ensure JSON is written the way it was originally written
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Cleaned: {json_file.name}")



Cleaned: Streaming_History_Audio_2020_1.json
Cleaned: Streaming_History_Audio_2020_2.json
Cleaned: Streaming_History_Audio_2021-2022_6.json
Cleaned: Streaming_History_Audio_2017-2020_0.json
Cleaned: Streaming_History_Audio_2020-2021_3.json
Cleaned: Streaming_History_Audio_2023_9.json
Cleaned: Streaming_History_Audio_2021_5.json
Cleaned: Streaming_History_Audio_2022_7.json
Cleaned: Streaming_History_Audio_2021_4.json
Cleaned: Streaming_History_Audio_2022-2023_8.json
Cleaned: Streaming_History_Audio_2023-2024_10.json
Cleaned: Streaming_History_Audio_2024-2025_12.json
Cleaned: Streaming_History_Audio_2024_11.json
Cleaned: Streaming_History_Video_2020-2025.json
