
# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


## The Big Picture
Think of your Colab notebook as the entry point to your research repo.
The notebook does the work (collects data, logs metadata), while the repo (on GitHub) stores the evidence — code, data samples, metadata logs, and ethical documentation.

As a prerequisite, you need to create the GitHub repo first (empty). See the next cell for details.



## Create an empty GitHub repo (UI steps)
1. Sign in to GitHub.
2. Click the + (top-right) → New repository.
3. Repository name: e.g., csc786-ethics-demo.
4. Owner: your account.
5. Visibility: Public (recommended for this class) or Private.
6. Important: Do NOT check “Add a README”, “Add .gitignore”, or “Choose a license”. Leaving these unchecked keeps the repo truly empty, which makes the first push from Colab simplest.
7. Click Create repository.
8. On the next page, copy the HTTPS URL. You will it use it later in notebook.

# Create (or confirm) a GitHub Personal Access Token (PAT) for Colab pushes
You’ll push from Colab using HTTPS + a token (safer/simpler than SSH during class).
1. Go to https://github.com/settings → Developer settings → Personal access tokens. Choose “Fine-grained tokens” (preferred).
2. Generate new token
- Token name (e.g. colab-demo)
- Only select respositories -> choose course repository
- Permissions -> Add permissions -> Contents -> Access: Read and write  
3. Generate the token and copy it once (you won’t see it again).

Tip: keep this token handy just for the class; you can revoke it afterward.

# Setup Cell
Run once per session

In [None]:
from getpass import getpass
import os
os.environ["GITHUB_TOKEN"] = getpass("Enter your GitHub PAT: ")

!git config --global user.name "Taiye03" ## Display name not necessarily your username
!git config --global user.email "olomutaiye03@gmail.com"

Enter your GitHub PAT: ··········


# One time only: Connect the empty repo from Colab (first push)

In [None]:
!git init
!git add .
!git commit -m "Initial reproducibility demo"
!git branch -M main

# Replace <username> and <PAT> and repo name.

!git remote add origin https://Taiye03:$GITHUB_TOKEN@github.com/Taiye03/csc786-ethics-demo.git

!git push -u origin main

If everything is correct, you’ll see the push succeed and your files appear in the GitHub repo (refresh the repo page).

In [None]:
%%bash
# --- Create and push .gitignore for clean, ethical repo ---

cat > .gitignore << 'EOF'
.ipynb_checkpoints/
__pycache__/
data/*
.env
*.env
EOF

git add .gitignore
git commit -m "Add .gitignore for data, cache, and secrets"
# Use the GITHUB_TOKEN environment variable for authentication
git push https://Taiye03:$GITHUB_TOKEN@github.com/Taiye03/csc786-ethics-demo.git

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


Everything up-to-date


# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [None]:
# 1. Clone your existing repo from GitHub
!git clone https://Taiye03:$GITHUB_TOKEN@github.com/Taiye03/csc786-ethics-demo.git # todo update url
%cd csc786-ethics-demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again


!git add .
!git commit -m "Update from Colab session"
!git push


In [None]:
# You can always check what's currently configured by:

!git config --global --list

user.name=Taiye03
user.email=olomutaiye03@gmail.com


## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

## Step 1 – Setup Environment

In [None]:
!pip install python-dotenv --quiet
import os, pandas as pd, requests, hashlib, json, sys, time
from datetime import datetime, timezone
from pathlib import Path

ROOT = Path("/content/csc786-ethics-demo") ## todo: may update repo name if needed
DATA = ROOT / "data"
DATA.mkdir(exist_ok=True)
print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/csc786-ethics-demo/data



# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


## Step 2 – Create Reproducibility Documentation Files

In [None]:
from pathlib import Path
ROOT = Path("/content/csc786-ethics-demo")

# 1 - README.md  (general project overview)
readme_text = """# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
This project collects sample open data from ---- ,
logs all collection parameters and metadata, and stores them in a version-controlled repository.

## Files
| File | Purpose |
|------|----------|
| `README.md` | Project overview and usage instructions |
| `ETHICS.md` | Ethical statement for transparency |
| `DATA_README.md` | Auto-logged metadata for every data collection event |


"""
(ROOT / "README.md").write_text(readme_text)


# 2 - ETHICS.md  → ethical statement / responsible data use
ethics_text = """## Ethical Statement

- Data sources are open and public.
- No personally identifiable information (PII) is collected.
- All API usage complies with provider Terms of Service and rate limits.
- API keys (if required) are stored securely using environment variables.
- Every dataset generated is logged with parameters, timestamps, and hashes in `DATA_README.md`.
- This workflow aligns with academic integrity and reproducibility standards at Dakota State University.

- Potential risks (bias, privacy, security)
- Mitigations (data handling, bias checks)
- Limitations (known constraints)

---

"""
(ROOT / "ETHICS.md").write_text(ethics_text)


# 3 - DATA_README.md  → provenance log (append-only)
data_readme_path = ROOT / "DATA_README.md"
if not data_readme_path.exists():
    data_readme_path.write_text("""# Data Provenance Log
Each entry below documents a data-collection event.
Auto-generated by the notebook.

Example entry format (udpate to match your data):
- {"timestamp_utc": "...", "endpoint": "...", "params": {...}, "output": "...", "sha256": "..."}

---
""")

print("Created reproducibility files:")
!ls -lh /content/csc786-ethics-demo/*.md

Created reproducibility files:
-rw-r--r-- 1 root root 1.4K Oct 27 03:49 /content/csc786-ethics-demo/DATA_README.md
-rw-r--r-- 1 root root  596 Oct 27 03:59 /content/csc786-ethics-demo/ETHICS.md
-rw-r--r-- 1 root root  564 Oct 27 03:59 /content/csc786-ethics-demo/README.md


## Step 3 – Managing Secrets (Key-based API Example)

In [None]:

# Example using OpenWeatherMap (requires free key)
# Register: https://home.openweathermap.org/users/sign_up

# Store key securely in this Colab session
from getpass import getpass
import os

os.environ["OPENWEATHER_API_KEY"] = getpass("Enter your OpenWeatherMap API key: ")

API_KEY = os.getenv("OPENWEATHER_API_KEY")
print("Key loaded:", API_KEY[:6] + "****" if API_KEY else "No key found")


Enter your OpenWeatherMap API key: ··········
Key loaded: c113ba****


### Example: Fetch Data Using OpenWeather API

In [None]:
url = "https://api.openweathermap.org/data/2.5/weather"
params = {"q": "Sioux Falls", "appid": API_KEY, "units": "metric"}

r = requests.get(url, params=params, timeout=10)
r.raise_for_status()
data = r.json()

weather = {
    "city": data["name"],
    "temperature": data["main"]["temp"],
    "humidity": data["main"]["humidity"],
    "condition": data["weather"][0]["description"]
}
weather


{'city': 'Sioux Falls',
 'temperature': 11.58,
 'humidity': 73,
 'condition': 'broken clouds'}

## Step 4 – Public API Example (Open-Meteo)

You will work with your own Key-based API.

In [None]:
ENDPOINT = "https://api.open-meteo.com/v1/forecast"
PARAMS = {
    "latitude": 44.6,
    "longitude": -96.8,
    "hourly": "temperature_2m",
    "forecast_days": 1
}

for attempt in range(3):
    try:
        r = requests.get(ENDPOINT, params=PARAMS, timeout=10)
        r.raise_for_status()
        break
    except requests.exceptions.RequestException as e:
        wait = 2 ** attempt
        print(f"Retrying in {wait}s due to: {e}")
        time.sleep(wait)

data = r.json()

df = pd.DataFrame({
    "time": data["hourly"]["time"],
    "temperature_2m": data["hourly"]["temperature_2m"]
})
df.head()


Unnamed: 0,time,temperature_2m
0,2025-10-27T00:00,13.9
1,2025-10-27T01:00,12.2
2,2025-10-27T02:00,11.2
3,2025-10-27T03:00,11.0
4,2025-10-27T04:00,10.4


## Step 5 – Save Data and Log Provenance

In [None]:
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H%M%SZ")
out_csv = DATA / f"hourly_temps_{timestamp}.csv"
df.to_csv(out_csv, index=False)

file_hash = hashlib.sha256(out_csv.read_bytes()).hexdigest()

meta = {
    "timestamp_utc": timestamp,
    "endpoint": ENDPOINT,
    "params": PARAMS,
    "output": out_csv.name,
    "sha256": file_hash,
    "python": sys.version.split()[0],
    "pandas": pd.__version__,
    "requests": requests.__version__,
}

with open(ROOT / "DATA_README.md", "a") as f:
    f.write(f"\n- {json.dumps(meta)}")

print(f"Saved {out_csv.name}, hash={file_hash[:10]}…")
!tail -n 3 DATA_README.md


You can veryify everything before pushing.

In [None]:
!ls -lh /content
!ls -lh /content/csc786-ethics-demo
!head -n 5 README.md
!tail -n 5 DATA_README.md

total 8.0K
drwxr-xr-x 4 root root 4.0K Oct 27 03:59 csc786-ethics-demo
drwxr-xr-x 1 root root 4.0K Oct 23 13:40 sample_data
total 40K
-rw-r--r-- 1 root root  22K Oct 27 03:57 CSC786_Ethics_Demo_ST.ipynb
drwxr-xr-x 2 root root 4.0K Oct 27 03:59 data
-rw-r--r-- 1 root root 1.4K Oct 27 03:49 DATA_README.md
-rw-r--r-- 1 root root  596 Oct 27 03:59 ETHICS.md
-rw-r--r-- 1 root root  564 Oct 27 03:59 README.md
# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
---

- {"timestamp_utc": "2025-10-27T021037Z", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": 44.6, "longitude": -96.8, "hourly": "temperature_2m", "forecast_days": 1}, "output": "hourly_temps_2025-10-27T021037Z.csv", "sha256": "5ef62bf2fe81a8cb04d10914d2d3f129d520877c005992cc66da03b36c481a62", "python": "3.12.12", "pandas": "2.2.2", "requests": "2.32.4"}
- {"timestamp_utc": "2025-10-

## Step 7 – Push to GitHub

In [None]:
!git remote set-url origin https://Taiye03:$GITHUB_TOKEN@github.com/Taiye03/csc786-ethics-demo.git

!git add .
!git commit -m "Update from Colab session"
!git push

## Step 8 – Wrap-Up & Reflection


### In this demo we:
- Accessed both key-based and open APIs ethically.  
- Created transparency files: README.md, ETHICS.md, DATA_README.md.  
- Logged complete metadata (endpoint, params, hash, timestamp).  
- Pushed the entire reproducible workflow to GitHub.  

### Now think:
- How could you adapt this structure for your own project?  
- What extra metadata might your discipline require (license, consent, citation)?  

In [None]:
!ls -lha

total 16K
drwxr-xr-x 1 root root 4.0K Oct 27 03:48 .
drwxr-xr-x 1 root root 4.0K Oct 27 02:06 ..
drwxr-xr-x 4 root root 4.0K Oct 23 13:40 .config
drwxr-xr-x 1 root root 4.0K Oct 23 13:40 sample_data


In [None]:
import json, pathlib
from google.colab import _message

TARGET = "/content/csc786-ethics-demo/CSC786_Ethics_Demo_ST.ipynb"
nb = _message.blocking_request('get_ipynb', timeout_sec=120)['ipynb']
path = pathlib.Path(TARGET)
path.write_text(json.dumps(nb, indent=2))
print("Saved:", path)