# Module 06: Networking Basics for Data Science

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 60 minutes

**Prerequisites**: 
- Completed Modules 00-05
- Basic understanding of HTTP and APIs
- Familiarity with requests library (will install if needed)

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Check** network connectivity and diagnose issues
2. **Test** port availability and service accessibility
3. **Work with** HTTP APIs for data collection
4. **Download** datasets and files reliably
5. **Handle** proxy settings and authentication
6. **Monitor** network usage for data transfers

## Introduction: Why Networking for Data Scientists?

Data scientists constantly work with networked resources:

### Common Scenarios

**1. Data Collection**
- Fetch data from REST APIs
- Download datasets from cloud storage
- Scrape data from websites (ethically)
- Access databases over network

**2. Model Deployment**
- Serve predictions via HTTP API
- Check if port is available
- Test API endpoint accessibility
- Monitor service health

**3. Distributed Computing**
- Connect to remote Jupyter servers
- Access cloud GPU resources
- Transfer large datasets
- Check network bandwidth

**4. Troubleshooting**
- Debug connection failures
- Test if API is reachable
- Identify network bottlenecks
- Verify firewall rules

In [None]:
# Setup: Import required libraries
import socket
import subprocess
import sys
from pathlib import Path
import time
from urllib.parse import urlparse

# Install requests if needed
try:
    import requests
except ImportError:
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'requests', '-q'])
    import requests

print(f"requests version: {requests.__version__}")
print("Setup complete!")

## 1. Network Connectivity Testing

In [None]:
# Test network connectivity to a host
def check_connectivity(host, timeout=3):
    """
    Check if a host is reachable.
    
    Args:
        host: Hostname or IP address
        timeout: Timeout in seconds
    
    Returns:
        tuple: (reachable: bool, latency_ms: float or None)
    """
    try:
        start = time.time()
        # Try to resolve hostname and connect
        socket.setdefaulttimeout(timeout)
        socket.gethostbyname(host)
        latency_ms = (time.time() - start) * 1000
        return True, latency_ms
    except (socket.gaierror, socket.timeout):
        return False, None

# Test common hosts
hosts = ['google.com', 'github.com', 'pypi.org']
print("Testing connectivity:")
for host in hosts:
    reachable, latency = check_connectivity(host)
    if reachable:
        print(f"✓ {host}: {latency:.1f}ms")
    else:
        print(f"✗ {host}: Not reachable")

### 1.1 Port Testing

In [None]:
# Check if a specific port is open
def check_port(host, port, timeout=3):
    """
    Check if a port is open on a host.
    
    Args:
        host: Hostname or IP
        port: Port number
        timeout: Timeout in seconds
    
    Returns:
        bool: True if port is open
    """
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except:
        return False

# Test common ports
print("\nTesting ports on localhost:")
common_ports = {
    80: 'HTTP',
    443: 'HTTPS',
    8888: 'Jupyter',
    5432: 'PostgreSQL',
    27017: 'MongoDB'
}

for port, service in common_ports.items():
    is_open = check_port('localhost', port)
    status = "✓ Open" if is_open else "✗ Closed"
    print(f"Port {port} ({service}): {status}")

## 2. HTTP Requests for Data Collection

In [None]:
# Fetch data from API with error handling
def fetch_api_data(url, params=None, timeout=10):
    """
    Safely fetch data from an API.
    
    Args:
        url: API endpoint URL
        params: Query parameters dict
        timeout: Request timeout
    
    Returns:
        tuple: (success: bool, data: dict or str, error: str)
    """
    try:
        response = requests.get(url, params=params, timeout=timeout)
        response.raise_for_status()  # Raise exception for 4xx/5xx
        
        # Try to parse as JSON
        try:
            data = response.json()
        except:
            data = response.text
        
        return True, data, None
    
    except requests.exceptions.Timeout:
        return False, None, f"Request timed out after {timeout}s"
    except requests.exceptions.ConnectionError:
        return False, None, "Connection failed"
    except requests.exceptions.HTTPError as e:
        return False, None, f"HTTP error: {e}"
    except Exception as e:
        return False, None, f"Error: {e}"

# Example: Fetch public API data
success, data, error = fetch_api_data('https://api.github.com/zen')
if success:
    print(f"✓ API Response: {data}")
else:
    print(f"✗ Failed: {error}")

### 2.1 Downloading Files Reliably

In [None]:
# Download files with progress and resume capability
def download_file(url, dest_path, chunk_size=8192):
    """
    Download file with progress tracking.
    
    Args:
        url: URL to download from
        dest_path: Destination file path
        chunk_size: Download chunk size in bytes
    
    Returns:
        bool: Success status
    """
    dest_path = Path(dest_path)
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    
    try:
        response = requests.get(url, stream=True, timeout=30)
        response.raise_for_status()
        
        total_size = int(response.headers.get('content-length', 0))
        
        print(f"Downloading: {dest_path.name}")
        print(f"Size: {total_size / 1024 / 1024:.2f} MB")
        
        downloaded = 0
        with open(dest_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=chunk_size):
                if chunk:
                    f.write(chunk)
                    downloaded += len(chunk)
                    
                    if total_size > 0:
                        percent = (downloaded / total_size) * 100
                        print(f"\rProgress: {percent:.1f}%", end='')
        
        print("\n✓ Download complete!")
        return True
    
    except Exception as e:
        print(f"\n✗ Download failed: {e}")
        return False

# Example (small file)
# download_file('https://example.com/file.csv', 'data/downloaded.csv')
print("download_file() function ready!")

## 3. Practice Exercises

### Exercise 1: API Data Collector

Create a tool to collect data from multiple APIs:
1. Accept list of API URLs
2. Fetch data from each with retries
3. Save responses to JSON files
4. Log success/failure rates

**Hint**: Use `fetch_api_data()` with retry logic

In [None]:
# Exercise 1: Your solution here
import json

def collect_api_data(urls, output_dir, max_retries=3):
    """
    Collect data from multiple APIs.
    
    Args:
        urls: List of API URLs
        output_dir: Directory to save responses
        max_retries: Maximum retry attempts
    """
    # TODO: Implement API data collector
    pass

# Test your collector
# urls = ['https://api.example.com/data1', 'https://api.example.com/data2']
# collect_api_data(urls, 'data/api_responses')


### Exercise 2: Service Health Checker

Create a service health monitoring tool:
1. Check if services are running (port check)
2. Test HTTP endpoint response time
3. Verify response status codes
4. Alert if service is down

**Hint**: Combine port checking and HTTP requests

In [None]:
# Exercise 2: Your solution here

class ServiceHealthChecker:
    """
    Monitor health of network services.
    """
    
    def __init__(self, services):
        """
        Args:
            services: List of (name, url, expected_status) tuples
        """
        # TODO: Initialize checker
        pass
    
    def check_all(self):
        # TODO: Check all services
        pass

# Test your checker
# services = [('API', 'http://localhost:8000/health', 200)]
# checker = ServiceHealthChecker(services)
# checker.check_all()


### Exercise 3: Dataset Downloader

Create a robust dataset downloader:
1. Download from list of URLs
2. Resume interrupted downloads
3. Verify file integrity (checksums)
4. Extract archives automatically

**Hint**: Use download_file() and add resume support

In [None]:
# Exercise 3: Your solution here

def download_datasets(dataset_urls, output_dir):
    """
    Download multiple datasets with resume support.
    
    Args:
        dataset_urls: Dict of {name: url}
        output_dir: Output directory
    """
    # TODO: Implement dataset downloader
    pass

# Test your downloader
# datasets = {'sample': 'https://example.com/dataset.zip'}
# download_datasets(datasets, 'data/downloads')


## 4. Summary

### Key Concepts

1. **Connectivity Testing**
   - Use `socket` for host/port checking
   - Test before making requests
   - Measure latency

2. **HTTP Requests**
   - Use `requests` library
   - Handle timeouts and errors
   - Parse JSON responses

3. **File Downloads**
   - Stream large files
   - Track progress
   - Handle interruptions

### What's Next?

In **Module 07: Scheduled Tasks & Automation**, you'll learn:
- Windows Task Scheduler
- Automated data collection
- Cron-like scheduling
- Email notifications

---

**Continue to Module 07** when ready!