
# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


## The Big Picture
Think of your Colab notebook as the entry point to your research repo.
The notebook does the work (collects data, logs metadata), while the repo (on GitHub) stores the evidence — code, data samples, metadata logs, and ethical documentation.

As a prerequisite, you need to create the GitHub repo first (empty). See the next cell for details.



## Create an empty GitHub repo (UI steps)
1. Sign in to GitHub.
2. Click the + (top-right) → New repository.
3. Repository name: e.g., csc786-ethics-demo.
4. Owner: your account.
5. Visibility: Public (recommended for this class) or Private.
6. Important: Do NOT check “Add a README”, “Add .gitignore”, or “Choose a license”. Leaving these unchecked keeps the repo truly empty, which makes the first push from Colab simplest.
7. Click Create repository.
8. On the next page, copy the HTTPS URL. You will it use it later in notebook.

# Create (or confirm) a GitHub Personal Access Token (PAT) for Colab pushes
You’ll push from Colab using HTTPS + a token (safer/simpler than SSH during class).
1. Go to https://github.com/settings → Developer settings → Personal access tokens. Choose “Fine-grained tokens” (preferred).
2. Generate new token
- Token name (e.g. colab-demo)
- Only select respositories -> choose course repository
- Permissions -> Add permissions -> Contents -> Access: Read and write  
3. Generate the token and copy it once (you won’t see it again).

Tip: keep this token handy just for the class; you can revoke it afterward.

# Setup Cell
Run once per session

In [23]:
%env GITHUB_TOKEN=entertoken

!git config --global user.name "Blake S" ## Display name not necessarily your username
!git config --global user.email "bschmiedty@gmail.com"

env: GITHUB_TOKEN=entertoken


# One time only: Connect the empty repo from Colab (first push)

In [None]:
!git init
!git add .
!git commit -m "Initial reproducibility demo"
!git branch -M main

# Replace <username> and <PAT> and repo name.

!git remote add origin https://BlakesterXD:$GITHUB_TOKEN@github.com/BlakesterXD/CSC786-Ethics-demo.git

!git push -u origin main

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
[master (root-commit) 7e4804a] Initial reproducibility demo
 21 files changed, 51025 insertions(+)
 create mode 100644 .config/.last_opt_in_prompt.yaml
 create mode 100644 .config/.last_survey_prompt.yaml
 create mode 100644 .config/.last_update_check.json
 create mode 100644 .config/active_config
 create mode 100644 .config/config_sentinel
 create mode 100644 .config/configurations/config_default
 create mode 100644 .config/default_configs.db
 create mode 100

If everything is correct, you’ll see the push succeed and your files appear in the GitHub repo (refresh the repo page).

In [None]:
%%bash
# --- Create and push .gitignore for clean, ethical repo ---

cat > .gitignore << 'EOF'
.ipynb_checkpoints/
__pycache__/
data/*
.env
*.env
EOF

git add .gitignore
git commit -m "Add .gitignore for data, cache, and secrets"
git push


[main 28874bf] Add .gitignore for data, cache, and secrets
 1 file changed, 5 insertions(+)
 create mode 100644 .gitignore


remote: This repository moved. Please use the new location:        
remote:   https://github.com/BlakesterXD/CSC786-Ethics-Demo.git        
To https://github.com/BlakesterXD/CSC786-Ethics-demo.git
   7e4804a..28874bf  main -> main


# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [2]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/BlakesterXD/CSC786-Ethics-demo.git
%cd CSC786-Ethics-Demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
# !git remote set-url origin https://BlakesterXD:$GITHUB_TOKEN@github.com/BlakesterXD/CSC786-Ethics-demo.git

# !git add .
# !git commit -m "Update from Colab session"
# !git push


Cloning into 'CSC786-Ethics-demo'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 46 (delta 9), reused 43 (delta 6), pack-reused 0 (from 0)[K
Receiving objects: 100% (46/46), 8.42 MiB | 16.71 MiB/s, done.
Resolving deltas: 100% (9/9), done.
[Errno 2] No such file or directory: 'CSC786-Ethics-Demo'
/content
fatal: not a git repository (or any of the parent directories): .git


In [5]:
# You can always check what's currently configured by:

!git config --global --list

user.name=Blake S
user.email=bschmiedty@gmail.com


## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

## Step 1 – Setup Environment

In [7]:
!pip install python-dotenv --quiet
import os, pandas as pd, requests, hashlib, json, sys, time
from datetime import datetime, timezone
from pathlib import Path

ROOT = Path("/content/CSC786-Ethics-demo") ## todo: may update repo name if needed
DATA = ROOT / "data"
ROOT.mkdir(exist_ok=True)
DATA.mkdir(exist_ok=True)
print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/CSC786-Ethics-demo/data



# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


## Step 2 – Create Reproducibility Documentation Files

In [54]:
from pathlib import Path
ROOT = Path("/content/CSC786-Ethics-demo")

# 1 - README.md  (general project overview)
readme_text = """# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
This project collects sample open data from ---- ,
logs all collection parameters and metadata, and stores them in a version-controlled repository.

## Files
| File | Purpose |
|------|----------|
| `README.md` | Project overview and usage instructions |
| `ETHICS.md` | Ethical statement for transparency |
| `DATA_README.md` | Auto-logged metadata for every data collection event |


"""
(ROOT / "README.md").write_text(readme_text)


# 2 - ETHICS.md  → ethical statement / responsible data use
ethics_text = """## Ethical Statement

- Data sources are open and public.
- No personally identifiable information (PII) is collected.
- All API usage complies with provider Terms of Service and rate limits.
- API keys (if required) are stored securely using environment variables.
- Every dataset generated is logged with parameters, timestamps, and hashes in `DATA_README.md`.
- This workflow aligns with academic integrity and reproducibility standards at Dakota State University.

- Potential risks (bias, privacy, security)
- Mitigations (data handling, bias checks)
- Limitations (known constraints)

---

"""
(ROOT / "ETHICS.md").write_text(ethics_text)


# 3 - DATA_README.md  → provenance log (append-only)
data_readme_path = ROOT / "DATA_README.md"
if not data_readme_path.exists():
    data_readme_path.write_text("""# Data Provenance Log
Each entry below documents a data-collection event.
Auto-generated by the notebook.

Example entry format (udpate to match your data):
- {"timestamp_utc": "...", "endpoint": "...", "params": {...}, "output": "...", "sha256": "..."}

---
""")

print("Created reproducibility files:")
!ls -lh /content | grep .md

Created reproducibility files:


## Step 3 – Managing Secrets (Key-based API Example)

In [24]:

# Example using OpenWeatherMap (requires free key)
# Register: https://home.openweathermap.org/users/sign_up

# Store key securely in this Colab session
%env OPENWEATHER_API_KEY=1234 # update this string

API_KEY = os.getenv("OPENWEATHER_API_KEY")
print("Key loaded:", API_KEY[:6] + "****" if API_KEY else "No key found")


env: OPENWEATHER_API_KEY=1234 # update this string
Key loaded: 1234 #****


### Example: Fetch Data Using OpenWeather API

In [9]:
url = "https://api.openweathermap.org/data/2.5/weather"
params = {"q": "Sioux Falls", "appid": API_KEY, "units": "metric"}

r = requests.get(url, params=params, timeout=10)
r.raise_for_status()
data = r.json()

weather = {
    "city": data["name"],
    "temperature": data["main"]["temp"],
    "humidity": data["main"]["humidity"],
    "condition": data["weather"][0]["description"]
}
weather


{'city': 'Sioux Falls',
 'temperature': 13.19,
 'humidity': 51,
 'condition': 'few clouds'}

## Step 4 – Public API Example (Open-Meteo)

You will work with your own Key-based API.

In [10]:
ENDPOINT = "https://api.open-meteo.com/v1/forecast"
PARAMS = {
    "latitude": 44.6,
    "longitude": -96.8,
    "hourly": "temperature_2m",
    "forecast_days": 1
}

for attempt in range(3):
    try:
        r = requests.get(ENDPOINT, params=PARAMS, timeout=10)
        r.raise_for_status()
        break
    except requests.exceptions.RequestException as e:
        wait = 2 ** attempt
        print(f"Retrying in {wait}s due to: {e}")
        time.sleep(wait)

data = r.json()

df = pd.DataFrame({
    "time": data["hourly"]["time"],
    "temperature_2m": data["hourly"]["temperature_2m"]
})
df.head()


Unnamed: 0,time,temperature_2m
0,2025-10-23T00:00,7.5
1,2025-10-23T01:00,5.7
2,2025-10-23T02:00,4.9
3,2025-10-23T03:00,4.6
4,2025-10-23T04:00,3.5


## Step 5 – Save Data and Log Provenance

In [11]:
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H%M%SZ")
out_csv = DATA / f"hourly_temps_{timestamp}.csv"
df.to_csv(out_csv, index=False)

file_hash = hashlib.sha256(out_csv.read_bytes()).hexdigest()

meta = {
    "timestamp_utc": timestamp,
    "endpoint": ENDPOINT,
    "params": PARAMS,
    "output": out_csv.name,
    "sha256": file_hash,
    "python": sys.version.split()[0],
    "pandas": pd.__version__,
    "requests": requests.__version__,
}

with open(ROOT / "DATA_README.md", "a") as f:
    f.write(f"\n- {json.dumps(meta)}")

print(f"Saved {out_csv.name}, hash={file_hash[:10]}…")
!tail -n 3 /content/CSC786-Ethics-demo/DATA_README.md


Saved hourly_temps_2025-10-23T181933Z.csv, hash=1fee45f7cc…

- {"timestamp_utc": "2025-10-23T181933Z", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": 44.6, "longitude": -96.8, "hourly": "temperature_2m", "forecast_days": 1}, "output": "hourly_temps_2025-10-23T181933Z.csv", "sha256": "1fee45f7ccbc1276cbb65ad4c07845fce46f04fcbbb4b9ad155cc3ce4c8f0296", "python": "3.12.12", "pandas": "2.2.2", "requests": "2.32.4"}

You can veryify everything before pushing.

In [14]:
!ls -lh /content/
!head -n 5 /content/CSC786-Ethics-demo/CSC786-Ethics-demo/DATA_README.md
!tail -n 5 /content/CSC786-Ethics-demo/CSC786-Ethics-demo/DATA_README.md

total 8.0K
drwxr-xr-x 7 root root 4.0K Oct 23 18:19 CSC786-Ethics-demo
drwxr-xr-x 1 root root 4.0K Oct 22 13:39 sample_data
# Data Provenance Log
Each entry below documents a data-collection event.
Auto-generated by the notebook.

Example entry format (udpate to match your data):
---

- {"timestamp_utc": "2025-10-21T190729Z", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": 44.6, "longitude": -96.8, "hourly": "temperature_2m", "forecast_days": 1}, "output": "hourly_temps_2025-10-21T190729Z.csv", "sha256": "979ce22b7fea00c2faae00e74f2283cf134ac44acfa3350366017ad7fa5c7a3b", "python": "3.12.12", "pandas": "2.2.2", "requests": "2.32.4"}
- {"timestamp_utc": "2025-10-21T190809Z", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": 44.6, "longitude": -96.8, "hourly": "temperature_2m", "forecast_days": 1}, "output": "hourly_temps_2025-10-21T190809Z.csv", "sha256": "979ce22b7fea00c2faae00e74f2283cf134ac44acfa3350366017ad7fa5c7a3b", "python": 

## Step 7 – Push to GitHub

In [22]:
!git remote set-url origin https://BlakesterXD:$GITHUB_TOKEN@github.com/BlakesterXD/CSC786-Ethics-Demo.git

!git add .
!git commit -m "add commit message here"
!git push

fatal: pathspec '/content/CSC786-Ethics-demo/DATA.README' did not match any files
[main 0530f59] removed accidental adding
 1 file changed, 2 deletions(-)
 delete mode 100644 DATA_README.md
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 2 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 234 bytes | 234.00 KiB/s, done.
Total 2 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/BlakesterXD/CSC786-Ethics-Demo.git
   1baa3d3..0530f59  main -> main


## Step 8 – Wrap-Up & Reflection


### In this demo we:
- Accessed both key-based and open APIs ethically.  
- Created transparency files: README.md, ETHICS.md, DATA_README.md.  
- Logged complete metadata (endpoint, params, hash, timestamp).  
- Pushed the entire reproducible workflow to GitHub.  

### Now think:
- How could you adapt this structure for your own project?  
- What extra metadata might your discipline require (license, consent, citation)?  