<h3 style="color:#fbbc05;">2.1 Collecting Data: The First Step in Any ML Project</h3>

Before you can build a model, you need data. In industry, data comes from three primary sources, each with its own considerations, advantages, and challenges.

<h4 style="color:#1a73e8;">2.1.1 Open-Source Datasets</h4>

These are invaluable for learning, prototyping, and benchmarking. They're free, well-documented, and often come with example code.

**Major Sources**:

1. **Kaggle Datasets** (`kaggle.com/datasets`)
   - **What it is**: The world's largest data science community with millions of datasets
   - **Types**: Everything from Titanic passenger lists to satellite imagery to financial data
   - **Advantages**: 
     - Datasets are often cleaned and documented
     - Community discussions and kernels (example code)
     - Competitions provide real-world problems
   - **Industry Example**: A fintech startup prototypes a credit risk model using the "Give Me Some Credit" dataset before accessing their proprietary user data. This allows them to test algorithms and validate approaches without risking sensitive customer information.

2. **UCI Machine Learning Repository** (`archive.ics.uci.edu/ml`)
   - **What it is**: A long-standing academic repository maintained by UC Irvine
   - **Types**: Classic datasets used in research papers (Iris, Wine, Adult Income, etc.)
   - **Advantages**: 
     - Well-documented with metadata
     - Used in research, so results are comparable
     - Clean, structured format
   - **Industry Example**: Researchers at a hospital use the UCI "Heart Disease" dataset to validate a new diagnostic feature before clinical trials. This helps them understand data requirements and expected performance.

3. **Government & Public Data**
   - **USA**: `data.gov` (housing, climate, economic indicators, transportation)
   - **EU**: `data.europa.eu` (European Union open data)
   - **UK**: `data.gov.uk` (UK government data)
   - **World Bank**: `data.worldbank.org` (global economic and social data)
   - **Advantages**: 
     - Real-world, often large-scale data
     - Updated regularly
     - Free and legally safe to use
   - **Industry Example**: A logistics company uses U.S. Department of Transportation data to predict highway congestion. This real-time data helps them optimize delivery routes and reduce fuel costs.

**Downloading from Kaggle**:

In [None]:
# Install Kaggle API: pip install kaggle
# Get API credentials from: https://www.kaggle.com/account

import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

# Download a dataset
api.dataset_download_files('dataset-name', path='./data', unzip=True)

<h4 style="color:#1a73e8;">2.1.2 APIs (Application Programming Interfaces)</h4>

APIs provide structured, programmatic access to live or regularly updated data. They're essential for production ML systems that need real-time or frequently refreshed data.

**What is an API?** An API is a way for different software systems to communicate. In data collection, APIs let you request data from a server and receive it in a structured format (usually JSON).

**Common API Use Cases**:
- **Weather data**: For agricultural yield prediction, energy demand forecasting
- **Financial data**: Stock prices, exchange rates, economic indicators
- **Social media**: Twitter, Reddit (with rate limits and terms of service)
- **E-commerce**: Product prices, reviews, inventory levels
- **Government**: Census data, employment statistics

**Example: Fetching Weather Data**:

In [None]:
import requests
import json
import os
from datetime import datetime

# Get your free API key from https://openweathermap.org/api
# Store it as an environment variable (never hardcode!)
API_KEY = os.getenv('OPENWEATHER_API_KEY')  # Set in your system or .env file
CITY = "London"

# Construct the API URL
url = f"http://api.openweathermap.org/data/2.5/weather?q={CITY}&appid={API_KEY}&units=metric"

try:
    # Make the request
    response = requests.get(url, timeout=10)  # 10 second timeout
    
    # Check if request was successful
    if response.status_code == 200:
        weather_data = response.json()
        
        # Extract relevant information
        temperature = weather_data['main']['temp']
        humidity = weather_data['main']['humidity']
        description = weather_data['weather'][0]['description']
        
        print(f"Current weather in {CITY}:")
        print(f"  Temperature: {temperature}Â°C")
        print(f"  Humidity: {humidity}%")
        print(f"  Conditions: {description}")
        
        # Save to DataFrame for ML use
        import pandas as pd
        df = pd.DataFrame([{
            'city': CITY,
            'timestamp': datetime.now(),
            'temperature': temperature,
            'humidity': humidity,
            'description': description
        }])
        
    else:
        print(f"Error: {response.status_code}")
        print(f"Message: {response.text}")
        
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

**Best Practices for API Usage**:
1. **Rate Limiting**: Respect API rate limits. Add delays between requests:

In [None]:
import time
time.sleep(1)  # Wait 1 second between requests
 

2. **Error Handling**: Always handle errors gracefully: 

In [None]:
try:
       response = requests.get(url)
       response.raise_for_status()  # Raises exception for bad status codes
   except requests.exceptions.HTTPError as e:
       print(f"HTTP error: {e}")
   except requests.exceptions.RequestException as e:
       print(f"Request error: {e}")

3. **Authentication**: Never hardcode API keys. Use environment variables:

In [None]:
# Create a .env file (add to .gitignore!)
# OPENWEATHER_API_KEY=your_key_here
   
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv('OPENWEATHER_API_KEY') 

4. **Caching**: Cache API responses to avoid unnecessary requests:

In [None]:
import pickle
import hashlib
   
       def get_cached_data(url, cache_dir='cache'):
           # Create hash of URL as cache key
           cache_key = hashlib.md5(url.encode()).hexdigest()
           cache_path = f"{cache_dir}/{cache_key}.pkl"
           
           if os.path.exists(cache_path):
               with open(cache_path, 'rb') as f:
                   return pickle.load(f)
           else:
               response = requests.get(url)
               data = response.json()
               os.makedirs(cache_dir, exist_ok=True)
               with open(cache_path, 'wb') as f:
                   pickle.dump(data, f)

> **Security Note**: Never commit API keys to version control. Use environment variables or secure key management services (AWS Secrets Manager, Azure Key Vault) in production.

<h4 style="color:#1a73e8;">2.1.3 Web Scraping (Use with Extreme Caution)</h4>

Web scraping involves programmatically extracting data from websites. This is a **legal and ethical gray area** that requires careful consideration.

**When Scraping is Acceptable**:
- The website's `robots.txt` file permits it (check `website.com/robots.txt`)
- The data is publicly available and not behind authentication
- You're not overloading their servers (add delays between requests)
- The data is not protected by copyright or terms of service
- You're using the data for research or personal learning (not commercial use without permission)

**When to Avoid Scraping**:
- The website explicitly prohibits it in their Terms of Service
- The data requires authentication (login)
- You're scraping at a rate that could harm the website
- The data is copyrighted or proprietary
- You plan to use it commercially without permission

**Basic Web Scraping Example** (Educational Only):

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

# Always check robots.txt first!
# Example: https://example.com/robots.txt

def scrape_with_respect(url, delay=2):
    """
    Scrape a webpage with respect for the server.
    
    Parameters:
    - url: The webpage to scrape
    - delay: Seconds to wait between requests (be respectful!)
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Educational Bot)'  # Identify yourself
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract data (example: finding all links)
        links = []
        for link in soup.find_all('a', href=True):
            links.append(link['href'])
        
        # Be respectful: wait before next request
        time.sleep(delay)
        
        return links
        
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return []

# Example usage (only for educational purposes!)
# data = scrape_with_respect('https://example.com')

**Industry Context**: An e-commerce company monitors competitor pricing by scraping public product pages. This is a common practice, though legally sensitive. Many companies use specialized services (like Price2Spy) that have agreements with retailers, rather than scraping directly.

> **Golden Rule**: **Always prioritize data provenance and ethics**. Biased or illegally obtained data can lead to model failure, reputational damage, or legal liability. When in doubt, use official APIs or purchase data from legitimate providers.

---