## 1. DATA COLLECTION CHALLENGES

### Definition
**Data Collection** is the foundation of ML projects. Poor collection = poor models, no matter how sophisticated algorithms are.

### Challenge Overview
Getting right data is 80% of ML work, yet often overlooked in favor of model building.

```
Good Model + Bad Data = Bad Predictions ❌
Bad Model + Good Data = Can be improved ✅
```

### 1.1 Web Scraping for Data Collection

#### What is Web Scraping?
Extracting data from websites programmatically instead of manually copying.

#### Challenges with Web Scraping:

**1. Legal & Ethical Issues**
- Many websites forbid scraping in Terms of Service
- Copyright concerns for extracted content
- GDPR, CCPA compliance required
- Legal action possible from website owners


In [None]:
# Example: What NOT to do (scraping without permission)
import requests
from bs4 import BeautifulSoup

# ❌ Scraping without checking robots.txt
url = "https://example.com/data"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = soup.find_all('div', class_='data')
# Might violate ToS!


**2. Anti-Scraping Protection**
- Websites block scrapers with CAPTCHA
- IP blocking after multiple requests
- JavaScript rendering required (static HTML won't work)
- Rate limiting and throttling


In [None]:
# Example: Handling anti-scraping measures
import time
from selenium import webdriver

# ❌ Simple approach: Gets blocked
for i in range(1000):
    page = requests.get(f'https://example.com/page/{i}')
    # IP gets blocked after ~10 requests

# ✅ Better approach: Use proxy rotation + delays
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By

driver = Chrome()

for i in range(100):
    driver.get(f'https://example.com/page/{i}')
    time.sleep(2)  # Respectful delay
    data = driver.find_elements(By.CLASS_NAME, 'data')
    # Process data
    
driver.quit()


**3. Dynamic Content**
- Websites load data with JavaScript
- Content not in initial HTML
- Requires browser automation (slow, resource-intensive)


In [None]:
# Example: Dynamic content challenges
# ❌ Won't work: BeautifulSoup only gets initial HTML
response = requests.get('https://spa-website.com')
soup = BeautifulSoup(response.content, 'html.parser')
data = soup.find_all('div', class_='dynamic-content')
# Empty! Content loaded by JavaScript

# ✅ Solution: Use Selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://spa-website.com')

# Wait for JavaScript to render
element = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CLASS_NAME, "dynamic-content"))
)

data = driver.find_elements(By.CLASS_NAME, 'dynamic-content')
print(f"Found {len(data)} items")
driver.quit()


#### Best Practices for Web Scraping:
1. **Check robots.txt:** `https://example.com/robots.txt`
2. **Respect ToS:** Read Terms of Service
3. **Add delays:** Use `time.sleep()` between requests
4. **Identify yourself:** Set proper User-Agent headers
5. **Use APIs first:** If available (easier, legal, faster)
6. **Rotate IPs:** Use proxy services for large-scale scraping
7. **Monitor rate:** Don't overload servers

### 1.2 API-Based Data Collection

#### What are APIs?
Structured way to request data from services (better than scraping).

#### Advantages:
- ✅ Legal and authorized
- ✅ Structured data format (JSON, XML)
- ✅ Real-time updates
- ✅ Rate limiting is fair
- ✅ Documentation available

#### Challenges:

**1. Rate Limiting**


In [None]:
# Example: Hitting rate limits
import requests
import time

API_URL = "https://api.example.com/data"
API_KEY = "your-api-key"

# ❌ Too fast: Rate limit exceeded
for i in range(1000):
    response = requests.get(
        API_URL,
        params={'id': i},
        headers={'Authorization': f'Bearer {API_KEY}'}
    )
    # Error: 429 Too Many Requests

# ✅ Respectful approach: Use delays and backoff
import random

for i in range(1000):
    try:
        response = requests.get(
            API_URL,
            params={'id': i},
            headers={'Authorization': f'Bearer {API_KEY}'},
            timeout=10
        )
        
        if response.status_code == 429:  # Rate limited
            wait_time = int(response.headers.get('Retry-After', 60))
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        elif response.status_code == 200:
            data = response.json()
            # Process data
            time.sleep(random.uniform(1, 3))  # Respectful delay
        else:
            print(f"Error: {response.status_code}")
            
    except requests.exceptions.Timeout:
        print("Request timeout, retrying...")
        time.sleep(5)


**2. Quota Limits**
- API calls limited per month/day
- Paid plans for more requests
- Sometimes insufficient for training data needs


In [None]:
# Example: Tracking API quota
class APIClient:
    def __init__(self, api_key, monthly_quota=1000):
        self.api_key = api_key
        self.monthly_quota = monthly_quota
        self.calls_used = 0
    
    def get_data(self, endpoint):
        if self.calls_used >= self.monthly_quota:
            print(f"❌ Quota exceeded! {self.calls_used}/{self.monthly_quota}")
            return None
        
        response = requests.get(
            endpoint,
            headers={'Authorization': f'Bearer {self.api_key}'}
        )
        
        self.calls_used += 1
        return response.json()

client = APIClient(monthly_quota=5000)

for i in range(10000):
    data = client.get_data(f'https://api.example.com/item/{i}')
    if data is None:
        print("Can't collect more data - quota exceeded!")
        break


**3. Unstable APIs**
- Endpoints change or disappear
- Response format changes
- Service downtime
- Authentication issues

#### Tools for Data Collection:


In [None]:
# Option 1: Beautiful Soup (simple HTML parsing)
from bs4 import BeautifulSoup
import requests

def scrape_simple(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup.find_all('div', class_='item')

# Option 2: Selenium (JavaScript rendering)
from selenium import webdriver

def scrape_dynamic(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.implicitly_wait(10)
    elements = driver.find_elements("class name", "item")
    driver.quit()
    return elements

# Option 3: Scrapy (industrial-grade scraping)
import scrapy
from scrapy.crawler import CrawlerProcess

class DataSpider(scrapy.Spider):
    name = "data_spider"
    start_urls = ['https://example.com']
    
    def parse(self, response):
        for item in response.css('div.item'):
            yield {
                'title': item.css('h2::text').get(),
                'price': item.css('span.price::text').get(),
            }

# Option 4: Requests + Pandas (for APIs)
import requests
import pandas as pd

def fetch_from_api(api_url):
    response = requests.get(api_url)
    data = response.json()
    df = pd.DataFrame(data)
    return df

# Option 5: Official Data Packages
import kaggle
import requests

# Download from Kaggle
kaggle.api.dataset_download_files('dataset-name')

# Use public datasets: OpenML, UCI, Google Datasets, GitHub


### Real-World Data Collection Scenarios:


In [None]:
# Scenario 1: E-commerce Product Data
# Challenge: Amazon heavily protects against scraping
# Solution: Use official APIs or buy pre-scraped datasets
import requests

def get_ecommerce_data_ethical():
    # Use official API instead of scraping
    api_key = "your-api-key"
    response = requests.get(
        "https://api.example.com/products",
        headers={'API-Key': api_key}
    )
    return response.json()

# Scenario 2: Social Media Data
# Challenge: APIs have strict rate limits
# Solution: Be selective about what/when to collect
import tweepy
import time

def collect_tweets_responsibly():
    client = tweepy.Client(bearer_token="YOUR_BEARER_TOKEN")
    
    tweets = []
    for query in ["python", "machine learning", "data science"]:
        response = client.search_recent_tweets(
            query=query,
            max_results=100  # Respect limits
        )
        tweets.extend(response.data)
        time.sleep(2)  # Respectful delay
    
    return tweets

# Scenario 3: Scientific Data
# Challenge: Limited availability, high cost
# Solution: Use open-source datasets, replicate studies
from sklearn.datasets import load_breast_cancer, fetch_20newsgroups
import tensorflow_datasets as tfds

# Use established datasets
data = load_breast_cancer()
# or
dataset = tfds.load('mnist')


---
