![Data collection banner](./images/1_data_collection_banner.png)

# 1. Data Collection

Gather labeled data consisting of input features (independent variables) and their corresponding target labels or output variables (dependent variables). This labeled data is essential for training supervised learning models.

## 1.1. Data Sources

![Data sources](./images/1_data_collection_data_sources.png)

### 1.1.1. Internal Data Sources

- **Databases and Data Warehouses**: Many organizations have existing databases or data warehouses that store structured data from various business operations, transactions, or customer interactions. This data can be a valuable source for supervised learning problems, provided it contains relevant features and target variables.

  Examples:
    - Telco Customer Churn Dataset - Telecom companies use their customer database with features like tenure, monthly charges, and service types to predict which customers are likely to churn
    - Retail Transaction Database - Point-of-sale systems generate transaction logs that can predict demand, detect fraud, or build recommendation systems

- **Log Files and Sensor Data**: For certain applications, log files from systems or sensor data from IoT devices can be used as data sources. These may require extensive preprocessing and feature engineering to extract meaningful information.

  Examples:
    - Manufacturing Sensor Data - IoT sensors on production lines capture temperature, pressure, and vibration data to predict equipment failure (predictive maintenance)
    - Server Log Files - Web server logs containing response times and error codes help predict system failures or detect security intrusions

### 1.1.2. External Data Sources

- **Web Scraping**: Data can be extracted from websites, online forums, social media platforms, or other publicly available online sources through web scraping techniques. However, it's essential to ensure compliance with website terms of service and data privacy regulations.

  Examples:
    - Real Estate Listings - Zillow/Redfin property data (prices, features, location) scraped for building property valuation models
    - Job Postings - Indeed/LinkedIn job descriptions and salaries scraped for salary prediction or skill demand analysis

- **Crowdsourcing and Surveys**: Crowdsourcing platforms or online surveys can be used to collect labeled data from human participants. This approach is particularly useful when manual labeling or annotations are required, such as in computer vision or natural language processing tasks.

  Examples:
    - ImageNet - 14 million images labeled by crowd workers into 20,000+ categories, revolutionizing computer vision
    - COCO Dataset - 200,000+ images with object detection labels and segmentation masks crowdsourced via Amazon MTurk
    - Stanford Question Answering Dataset (SQuAD) - 100,000+ question-answer pairs on Wikipedia articles created by crowdworkers

- **Open Data Repositories**: Various organizations and communities maintain open data repositories containing datasets across different domains, such as healthcare, finance, or environmental sciences. These datasets can be used for supervised learning tasks, provided they align with the problem at hand.

  Examples:
    - UCI Machine Learning Repository - 600+ datasets including the famous Wine Quality, Breast Cancer Wisconsin, and Adult Income datasets
    - Kaggle Datasets - Titanic survival prediction (classification), House Prices (regression), Credit Card Fraud Detection (imbalanced classification)
    - Government Data - Census.gov demographic data, NOAA weather data, FDA adverse event reports

- **Data Marketplaces**: There are commercial data marketplaces where organizations or individuals can purchase datasets for specific use cases or domains. These datasets are often curated, cleaned, and labeled, but they can be expensive.

  Examples:
    - Nielsen Data - Consumer behavior and retail analytics for demand forecasting
    - Experian/Equifax - Credit data for loan default prediction and credit scoring models

-----

## 1.2. Data Collection Considerations

- **Data Quality**: Ensure that the collected data is accurate, complete, and free from errors or inconsistencies. Data quality issues can significantly impact the performance of supervised learning models.

  Example:
    - Medical Diagnosis Dataset - Missing lab values or incorrect patient demographics can lead to dangerous misdiagnosis predictions

- **Data Relevance**: Carefully evaluate the relevance of the collected data to the problem you are trying to solve. Irrelevant or redundant features can introduce noise and negatively impact model performance.

  Example:
    - Credit Default Prediction - Including shoe size as a feature would be irrelevant, while credit history and income are highly relevant

- **Data Representativeness**: The collected data should be representative of the real-world scenario or population you want to model. Biased or skewed data can lead to models that perform poorly on unseen data.

  Examples:
    - Facial Recognition Dataset - If trained only on one demographic, the model will perform poorly on underrepresented groups
    - Stock Market Data - Training only on bull market data will fail during bear markets

- **Data Privacy and Ethics**: When collecting data, especially from individuals, it's crucial to consider data privacy regulations, such as GDPR or CCPA, and ensure that the data collection process adheres to ethical principles and guidelines.

  Examples:
    - Healthcare Records - HIPAA compliance required; datasets like MIMIC-III provide de-identified patient data
    - Social Media Data - Twitter API usage must comply with user privacy settings and platform terms

- **Data Labeling**: For supervised learning, the collected data must be labeled with the corresponding target variables or output classes. This labeling process can be time-consuming and may require domain expertise or crowdsourcing efforts.

  Examples:
    - Chest X-Ray Classification - Requires radiologists to label images for conditions like pneumonia (CheXpert dataset)
    - Sentiment Analysis - Amazon Product Reviews come pre-labeled with 1-5 star ratings serving as sentiment labels

- **Data Versioning and Documentation**: Maintain clear documentation and versioning of the collected data, including metadata, data dictionaries, and any preprocessing or cleaning steps applied. This ensures reproducibility and transparency in the modeling process.

  Example:
    - Fashion-MNIST - Provides clear versioning (v1, v2) with detailed documentation of 70,000 28x28 grayscale images in 10 categories

By considering these factors and leveraging appropriate data collection sources, you can ensure that the collected data is suitable for training supervised learning models and can lead to accurate and reliable predictions.

![Data sources](./images/1_data_collection_data_meme.jpg)