# The Data Scientist's Quest
## A Data Scientist's Guide to Syft-Client

```
╔══════════════════════════════════════════════════════════════════╗
║                    THE SYFT DANCE                                ║
║            Data Scientist (DS) Notebook                          ║
║                                                                  ║
║  This notebook is Part 2 of a 2-part collaboration demo.        ║
║  Run this alongside: DO_Journey.ipynb                           ║
╚══════════════════════════════════════════════════════════════════╝
```

### What You'll Do:
1. **Setup** - Install and authenticate
2. **Connect** - Add Data Owner as peer
3. **Explore** - Discover and understand available datasets
4. **Analyze** - Write and submit analysis code
5. **Results** - Retrieve computed results

### Prerequisites:
- Google account with Google Drive
- A partner running the DO notebook!
- **IMPORTANT**: Wait for DO to be ready before starting!

---
```
╔══════════════════════════════════════════════════════════════════╗
║  BEFORE YOU BEGIN                                               ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Make sure the Data Owner (DO) has:                              ║
║  1. Started their notebook                                       ║
║  2. Created their dataset                                        ║
║  3. Told you their email address                                 ║
║                                                                  ║
║  Ask DO: "What's your email? Are you ready?"                    ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝
```

---
# ACT 1: Setup
## Scene 1.1: Install Dependencies

In [None]:
#@title Install syft-client { display-mode: "form" }
!pip install -q git+https://github.com/OpenMined/syft-client.git@beach-hands-on-demo

In [None]:
#@title Import syft-client { display-mode: "form" }
# Suppress noisy Google httplib2 warnings
import logging
logging.getLogger('googleapiclient.discovery_cache').setLevel(logging.ERROR)
logging.getLogger('google_auth_httplib2').setLevel(logging.ERROR)

import syft_client as sc
print(f"syft-client version: {sc.__version__}")

## Scene 1.2: Mount Google Drive

In [None]:
#@title Mount Google Drive { display-mode: "form" }
from google.colab import drive
drive.mount('/content/drive')

## Scene 1.3: Enter Your Email

In [None]:
# Your email address (Data Scientist)
DS_EMAIL = input("Enter your email address (Data Scientist): ").strip()
print(f"\nYou are: {DS_EMAIL}")

## Scene 1.4: Login as Data Scientist

Colab will prompt you to authenticate with Google (allow access to Drive).

In [None]:
# Login as Data Scientist
# In Colab, authentication is handled automatically via Google's built-in auth
ds_client = sc.login_ds(email=DS_EMAIL)

print(f"\nLogged in as Data Scientist: {ds_client.email}")
print(f"   SyftBox folder: {ds_client.syftbox_folder}")

---
# ACT 2: Connect with Data Owner
## Scene 2.1: Get DO's Email

In [None]:
# Get the Data Owner's email
DO_EMAIL = input("Enter the Data Owner's email address: ").strip()
print(f"\nData Owner: {DO_EMAIL}")

## Scene 2.2: Add DO as Peer

In [None]:
# Add DO as peer
ds_client.add_peer(DO_EMAIL)

print(f"\nPeer request sent to {DO_EMAIL}!")
print("DO will receive an email notification.")

In [None]:
# Verify peer was added
ds_client.peers

---
```
╔══════════════════════════════════════════════════════════════════╗
║  INTERMISSION - WAITING FOR DATA OWNER                          ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Your peer request has been sent!                                ║
║                                                                  ║
║  Tell DO: "I've sent a peer request. Please accept it!"         ║
║                                                                  ║
║  The DO needs to:                                                ║
║  1. Run their 'Accept Peer' cell                                 ║
║  2. Enter your email to accept                                   ║
║                                                                  ║
║  You'll receive an EMAIL when DO accepts your request!          ║
║                                                                  ║
║  Wait for DO to accept, then continue to ACT 3...               ║
╚══════════════════════════════════════════════════════════════════╝
```

In [None]:
print("Tell DO: 'I've sent a peer request. Please accept it!'")
print("\nWaiting for DO to accept your peer request...")
print("   You'll receive an EMAIL when they accept!")
print("   Then continue to ACT 3.")

---
# ACT 3: Explore Available Data
## Scene 3.1: Sync with DO's Datasite

In [None]:
# Sync to get latest data from DO
ds_client.sync()
print("Synced with Google Drive")

## Scene 3.2: Discover Datasets

In [None]:
# List all available datasets from DO
datasets = ds_client.datasets.get_all(datasite=DO_EMAIL)
datasets

## Scene 3.3: Explore a Dataset

In [None]:
# Get the first dataset
if datasets:
    dataset = datasets[0]
    dataset.describe()
else:
    print("No datasets found. Make sure DO has created a dataset and accepted your peer request.")

In [None]:
# View dataset URLs
if datasets:
    print(f"Dataset: {dataset.name}")
    print(f"   Mock URL: {dataset.mock_url}")
    print(f"   Private URL: {dataset.private_url}")

## Scene 3.4: Preview Mock Data

The mock data shows the structure without revealing private information.

In [None]:
import pandas as pd
from pathlib import Path

# Read the mock data to understand the structure
if datasets and dataset.mock_files:
    mock_file = dataset.mock_files[0]  # First mock file
    print(f"Reading mock data from: {mock_file}")
    
    df_mock = pd.read_csv(mock_file)
    print(f"\nMock Data Preview ({len(df_mock)} rows):")
    print(df_mock.head(10))
    print(f"\nColumns: {list(df_mock.columns)}")

---
# ACT 4: Submit Analysis Job
## Scene 4.1: Construct Private Data Path

Use `syft://private/...` URLs to reference the DO's private data.
The `sc.resolve_path()` function converts these to actual file paths when running on DO's machine.

In [None]:
# Construct the private data URL
if datasets:
    private_url = str(dataset.private_url)
    
    # Get filename from mock files
    if dataset.mock_files_urls:
        mock_filename = Path(str(dataset.mock_files_urls[0])).name
        # Private files typically have same name or without 'mock' prefix
        private_filename = mock_filename.replace("_mock", "").replace("mock_", "")
        PRIVATE_DATA_PATH = f"{private_url}/{private_filename}"
    else:
        PRIVATE_DATA_PATH = f"{private_url}/sales_private.csv"
        
    print(f"Private data path to use in code:")
    print(f"   {PRIVATE_DATA_PATH}")
else:
    PRIVATE_DATA_PATH = f"syft://private/syft_datasets/sales-data/sales_private.csv"
    print(f"Using default private data path:")
    print(f"   {PRIVATE_DATA_PATH}")

## Scene 4.2: Write Analysis Code

In [None]:
# Write analysis code to a file
# This code will run on DO's machine with access to their private data

analysis_code = f'''
import os
import json
import syft_client as sc
import pandas as pd

# Access the private data using Syft URL
data_path = "{PRIVATE_DATA_PATH}"
resolved_path = sc.resolve_path(data_path)

print(f"Reading private data from: {{resolved_path}}")

# Load and analyze the data
df = pd.read_csv(resolved_path)

print(f"\\nDataset shape: {{df.shape}}")
print(f"Columns: {{list(df.columns)}}")

# Compute aggregate statistics
results = {{
    "total_rows": len(df),
    "columns": list(df.columns),
}}

# Compute stats for numeric columns
numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
for col in numeric_cols:
    results[f"{{col}}_sum"] = float(df[col].sum())
    results[f"{{col}}_mean"] = float(df[col].mean())

# Calculate total revenue if applicable
if "quantity" in df.columns and "price_per_unit" in df.columns:
    total_revenue = (df["quantity"] * df["price_per_unit"]).sum()
    results["total_revenue"] = float(total_revenue)
    print(f"\\nTotal Revenue: ${{total_revenue:,.2f}}")

print(f"\\nResults:")
print(json.dumps(results, indent=2))

# Save results
os.makedirs("outputs", exist_ok=True)
with open("outputs/analysis_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\\nResults saved to outputs/analysis_results.json")
'''

# Save to file
CODE_PATH = Path("/tmp/sales_analysis.py")
CODE_PATH.write_text(analysis_code)

print("Analysis code written to:", CODE_PATH)
print("\n" + "="*60)
print("CODE PREVIEW:")
print("="*60)
print(analysis_code)

## Scene 4.3: Submit the Job

In [None]:
import uuid

# Generate a unique job name
JOB_NAME = f"sales-analysis-{uuid.uuid4().hex[:8]}"

# Submit the job to DO
ds_client.submit_python_job(
    user=DO_EMAIL,
    code_path=str(CODE_PATH),
    job_name=JOB_NAME,
)

print(f"\nJob '{JOB_NAME}' submitted to {DO_EMAIL}!")
print("DO will receive an email notification.")

In [None]:
# View your submitted jobs
ds_client.jobs

---
```
╔══════════════════════════════════════════════════════════════════╗
║  INTERMISSION - WAITING FOR JOB APPROVAL & EXECUTION            ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Your job has been submitted!                                    ║
║                                                                  ║
║  Tell DO: "I've submitted a job. Please review and run it!"     ║
║                                                                  ║
║  The DO needs to:                                                ║
║  1. Review your code                                             ║
║  2. Approve the job                                              ║
║  3. Execute the job                                              ║
║                                                                  ║
║  You'll receive EMAIL notifications for:                        ║
║  - Job approved                                                  ║
║  - Job completed                                                 ║
║                                                                  ║
║  Wait for DO to execute, then continue to ACT 5...              ║
╚══════════════════════════════════════════════════════════════════╝
```

In [None]:
print("Tell DO: 'I've submitted a job. Please review and run it!'")
print("\nWaiting for DO to approve and execute your job...")
print("   You'll receive EMAIL notifications!")
print("   Then continue to ACT 5.")

---
# ACT 5: Retrieve Results
## Scene 5.1: Sync to Get Results

In [None]:
# Sync to get the latest job status and results
ds_client.sync()
print("Synced with Google Drive")

## Scene 5.2: Check Job Status

In [None]:
# View all jobs
ds_client.jobs

## Scene 5.3: View Job Output

In [None]:
# Get the completed job
done_jobs = [j for j in ds_client.jobs if j.status == "done"]

if done_jobs:
    job = done_jobs[-1]  # Most recent completed job
    print(f"Job: {job.name}")
    print(f"   Status: {job.status}")
    print(f"\nSTDOUT:")
    print(job.stdout)
else:
    print("No completed jobs yet. Wait for DO to execute your job.")

## Scene 5.4: Access Result Files

In [None]:
import json

if done_jobs:
    job = done_jobs[-1]
    print(f"Output files: {job.output_paths}")
    
    # Read the results file
    for output_path in job.output_paths:
        if str(output_path).endswith(".json"):
            print(f"\nReading results from: {output_path}")
            with open(output_path, "r") as f:
                results = json.load(f)
            
            print("\n" + "="*60)
            print("ANALYSIS RESULTS:")
            print("="*60)
            print(json.dumps(results, indent=2))

---
# ACT 6: Finale

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║  CONGRATULATIONS! THE QUEST IS COMPLETE!                        ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  As a Data Scientist, you successfully:                          ║
║                                                                  ║
║  - Set up credentials and authenticated                         ║
║  - Connected with a Data Owner                                  ║
║  - Discovered and explored available datasets                   ║
║  - Understood data structure using mock data                    ║
║  - Wrote analysis code using Syft URLs                          ║
║  - Submitted a job for remote execution                         ║
║  - Received email notifications for job status                  ║
║  - Retrieved computed results                                   ║
║                                                                  ║
║  You NEVER had direct access to the private data!               ║
║  Your code ran on DO's machine, and only results were returned. ║
║                                                                  ║
║  This is privacy-preserving data science in action!             ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝
""")

---
# Appendix: Additional Operations

In [None]:
# View all your peers
# ds_client.peers

In [None]:
# List datasets from ALL peers
# ds_client.datasets.get_all()

In [None]:
# View job history
# for job in ds_client.jobs:
#     print(f"{job.status}: {job.name}")