# Module 05: Data Collection Methods

Welcome to Module 05! Now that you know how to design experiments, let's learn how to **collect high-quality data** to support your research.

## What You'll Learn

- Difference between primary and secondary data
- Data collection techniques for different research types
- Ensuring data quality (the 6 dimensions)
- Data validation strategies
- Documentation during collection
- Common pitfalls and how to avoid them

## Prerequisites

- Completed Modules 00-04
- Understanding of your research design

## Time Required

**35 minutes**

---

In [None]:
# ========================================
# Setup
# ========================================

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Create output directory
output_dir = "outputs/notebook_05"
os.makedirs(output_dir, exist_ok=True)

print("‚úÖ Setup complete!")
print(f"Output directory: {output_dir}")

## Part 1: Primary vs Secondary Data

Understanding the source of your data is crucial for research design.

### Primary Data

**Definition**: Data YOU collect directly for YOUR research

**Characteristics**:
- Original collection
- Tailored to your research question
- You control quality
- Time-consuming and expensive

**Examples in Data Science**:
- Running experiments with ML models
- Conducting user surveys
- Collecting sensor data
- Recording user interactions
- Creating labeled datasets
- Running A/B tests

**Pros**:
- ‚úÖ Exactly what you need
- ‚úÖ You know the quality
- ‚úÖ Full control over collection
- ‚úÖ Can collect missing variables

**Cons**:
- ‚ùå Time-consuming
- ‚ùå Expensive
- ‚ùå Requires expertise
- ‚ùå May have sampling limitations

### Secondary Data

**Definition**: Data collected by OTHERS for different purposes

**Characteristics**:
- Pre-existing
- Not designed for your specific question
- Quality varies
- Usually faster and cheaper

**Examples in Data Science**:
- Public datasets (Kaggle, UCI ML Repository)
- Government data (census, economic indicators)
- Published datasets from papers
- Company databases
- Web-scraped data
- API data (Twitter, financial markets)

**Pros**:
- ‚úÖ Fast and cheap
- ‚úÖ Large sample sizes often available
- ‚úÖ Historical data
- ‚úÖ Professional collection

**Cons**:
- ‚ùå May not match your needs exactly
- ‚ùå Quality unknown or variable
- ‚ùå Missing variables you need
- ‚ùå Potential bias from original purpose

### Which Should You Use?

**Use Primary Data When**:
- No existing data available
- Need specific variables
- Quality control is critical
- Testing new methods/algorithms

**Use Secondary Data When**:
- Good datasets exist
- Limited time/budget
- Need historical data
- Replicating previous studies

**Best Practice**: Often use BOTH!
- Secondary data for exploration
- Primary data for confirmation

In [None]:
# ========================================
# Decision Tree: Primary vs Secondary Data
# ========================================

decision_tree = """
DATA SOURCE DECISION TREE
=========================

START: Do you need data for your research?
           ‚Üì
    Does suitable data already exist?
           ‚Üì
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
   YES            NO
    ‚Üì              ‚Üì
Is it accessible?  Must collect
    ‚Üì              PRIMARY DATA
‚îå‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îê             ‚Üì
YES    NO          Methods:
 ‚Üì      ‚Üì          - Experiments
Is    Must         - Surveys  
quality collect    - Observations
good? PRIMARY      - Sensors
 ‚Üì      DATA
‚îå‚î¥‚îê
YES  NO
 ‚Üì    ‚Üì
Use  Collect
SECONDARY  PRIMARY
DATA       DATA

DECISION FACTORS:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚ñ° Research question specificity
‚ñ° Time constraints  
‚ñ° Budget
‚ñ° Required quality
‚ñ° Sample size needed
‚ñ° Variable requirements
"""

print(decision_tree)

# Save decision tree
with open(f"{output_dir}/data_source_decision.txt", "w") as f:
    f.write(decision_tree)

print(f"\n‚úÖ Decision tree saved to: {output_dir}/data_source_decision.txt")

## Part 2: Data Collection Techniques

Different research questions require different collection methods.

### For Quantitative Research

#### 1. Experiments

**What**: Manipulate IV, measure DV under controlled conditions

**Example in Data Science**:
```python
# Test different algorithms
for algorithm in [RandomForest, XGBoost, NeuralNet]:
    # Train on same data
    model = algorithm.fit(X_train, y_train)
    # Measure performance
    accuracy = model.score(X_test, y_test)
    # Record results
    results.append({'algo': algorithm, 'acc': accuracy})
```

**Best for**: Testing causal relationships

#### 2. Surveys

**What**: Ask structured questions to many people

**Question Types**:
- **Closed**: Multiple choice, rating scales
- **Open**: Free text responses

**Example Questions**:
- "How satisfied are you with the model predictions?" (1-5 scale)
- "How often do you use this feature?" (Daily/Weekly/Monthly/Never)

**Best for**: Gathering opinions, behaviors, demographics at scale

**Tips**:
- Keep it short (< 10 minutes)
- Clear, unambiguous questions
- Avoid leading questions
- Pilot test first

#### 3. Observational Studies

**What**: Measure variables without manipulation

**Example**: Analyze existing user behavior logs
```python
# Observe natural behavior
user_logs = pd.read_csv('user_activity.csv')
# Measure variables
session_length = user_logs.groupby('user_id')['duration'].mean()
churn_rate = user_logs.groupby('cohort')['churned'].mean()
```

**Best for**: When experiments are impractical or unethical

#### 4. Sensor/Automated Collection

**What**: Continuous automated measurement

**Examples**:
- Server logs
- IoT sensor data
- Click tracking
- Model performance monitoring

**Best for**: Large-scale, continuous data

### For Qualitative Research

#### 1. Interviews

**What**: One-on-one conversations

**Types**:
- **Structured**: Fixed questions
- **Semi-structured**: Flexible guide
- **Unstructured**: Open conversation

**Example**: Interview data scientists about algorithm selection process

**Best for**: Deep understanding, exploring new areas

#### 2. Focus Groups

**What**: Group discussions (6-10 people)

**Example**: Discuss user experience with ML system

**Best for**: Generating ideas, understanding group dynamics

#### 3. Case Studies

**What**: In-depth study of specific instances

**Example**: Detailed analysis of model deployment at Company X

**Best for**: Understanding complex real-world situations

## Part 3: The 6 Dimensions of Data Quality

High-quality data is essential for valid research. Evaluate your data on these six dimensions:

### 1. Accuracy

**Definition**: Data correctly represents reality

**Questions to ask**:
- Are measurements correct?
- Any data entry errors?
- Are labels accurate?

**How to check**:
- Compare with ground truth (when available)
- Look for impossible values
- Cross-validate with other sources

**Example Issues**:
- Age = -5 (impossible)
- Price = $999999999 (likely error)
- Mislabeled images in training data

### 2. Completeness

**Definition**: All required data is present

**Questions to ask**:
- How many missing values?
- Are they missing at random?
- Can they be imputed?

**How to check**:
```python
# Check missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
```

**Thresholds**:
- < 5% missing: Usually okay
- 5-20% missing: Consider carefully
- > 20% missing: May need to drop or collect more

### 3. Consistency

**Definition**: Data is uniform and doesn't contradict itself

**Questions to ask**:
- Same format throughout?
- Consistent units?
- No contradictions?

**Example Issues**:
- Dates: "2024-01-15" vs "01/15/2024" vs "15-Jan-24"
- Units: Some heights in cm, others in inches
- Contradictions: Age = 25, Birth_Year = 1980

### 4. Timeliness

**Definition**: Data is current and up-to-date

**Questions to ask**:
- When was data collected?
- Is it still relevant?
- Has the phenomenon changed?

**Example**:
- Using 2015 social media data to study 2024 trends ‚ùå
- Stock prediction with 10-year-old data ‚ùå

### 5. Validity

**Definition**: Data measures what it's supposed to measure

**Questions to ask**:
- Does this measure my construct?
- Are values in valid ranges?
- Correct data types?

**Example Checks**:
```python
# Age should be 0-120
assert df['age'].between(0, 120).all()

# Percentage should be 0-100
assert df['percentage'].between(0, 100).all()

# Category should be in allowed values
valid_categories = ['A', 'B', 'C']
assert df['category'].isin(valid_categories).all()
```

### 6. Uniqueness

**Definition**: No unwanted duplicates

**Questions to ask**:
- Any duplicate records?
- Are duplicates intentional or errors?

**How to check**:
```python
# Find duplicates
duplicates = df.duplicated().sum()
print(f"Found {duplicates} duplicate rows")

# View duplicates
df[df.duplicated(keep=False)]
```

In [None]:
# ========================================
# Data Quality Assessment Tool
# ========================================


def assess_data_quality(df, required_columns=None, valid_ranges=None):
    """
    Comprehensive data quality assessment.

    Args:
        df (pd.DataFrame): Data to assess
        required_columns (list): Columns that must exist
        valid_ranges (dict): Valid ranges for numeric columns

    Returns:
        dict: Quality assessment report
    """
    report = {"total_rows": len(df), "total_columns": len(df.columns), "dimensions": {}}

    # 1. COMPLETENESS
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    report["dimensions"]["completeness"] = {
        "missing_values": missing.to_dict(),
        "missing_pct": missing_pct.to_dict(),
        "worst_column": missing_pct.idxmax(),
        "worst_pct": missing_pct.max(),
    }

    # 2. UNIQUENESS
    duplicates = df.duplicated().sum()
    report["dimensions"]["uniqueness"] = {
        "duplicate_rows": duplicates,
        "duplicate_pct": (duplicates / len(df)) * 100,
    }

    # 3. VALIDITY (if ranges provided)
    if valid_ranges:
        validity_issues = []
        for col, (min_val, max_val) in valid_ranges.items():
            if col in df.columns:
                invalid = ~df[col].between(min_val, max_val)
                invalid_count = invalid.sum()
                if invalid_count > 0:
                    validity_issues.append(
                        {
                            "column": col,
                            "invalid_count": invalid_count,
                            "expected_range": f"{min_val}-{max_val}",
                        }
                    )
        report["dimensions"]["validity"] = validity_issues

    # OVERALL QUALITY SCORE
    scores = []

    # Completeness score (100 if < 5% missing)
    avg_missing = missing_pct.mean()
    completeness_score = max(0, 100 - avg_missing * 2)
    scores.append(completeness_score)

    # Uniqueness score
    uniqueness_score = 100 - (duplicates / len(df)) * 100
    scores.append(uniqueness_score)

    report["overall_quality_score"] = np.mean(scores)

    return report


# Example: Generate sample data with quality issues
np.random.seed(42)
sample_data = pd.DataFrame(
    {
        "id": range(1, 101),
        "age": np.random.randint(18, 80, 100),
        "income": np.random.randint(20000, 100000, 100),
        "score": np.random.uniform(0, 100, 100),
    }
)

# Introduce quality issues
sample_data.loc[5, "age"] = np.nan  # Missing value
sample_data.loc[10, "age"] = 150  # Invalid value
sample_data = pd.concat([sample_data, sample_data.iloc[[0]]])  # Duplicate

# Assess quality
valid_ranges = {"age": (0, 120), "income": (0, 1000000), "score": (0, 100)}

quality_report = assess_data_quality(sample_data, valid_ranges=valid_ranges)

print("DATA QUALITY ASSESSMENT REPORT")
print("=" * 60)
print(
    f"\nDataset Size: {quality_report['total_rows']} rows √ó {quality_report['total_columns']} columns"
)
print(f"\nOverall Quality Score: {quality_report['overall_quality_score']:.1f}/100")
print("\nüìä QUALITY DIMENSIONS:")
print("\n1. COMPLETENESS:")
print(f"   Worst column: {quality_report['dimensions']['completeness']['worst_column']}")
print(f"   Missing: {quality_report['dimensions']['completeness']['worst_pct']:.1f}%")
print("\n2. UNIQUENESS:")
print(f"   Duplicates: {quality_report['dimensions']['uniqueness']['duplicate_rows']} rows")
print(f"   ({quality_report['dimensions']['uniqueness']['duplicate_pct']:.1f}%)")
if quality_report["dimensions"]["validity"]:
    print("\n3. VALIDITY ISSUES:")
    for issue in quality_report["dimensions"]["validity"]:
        print(f"   {issue['column']}: {issue['invalid_count']} invalid values")
        print(f"   (Expected range: {issue['expected_range']})")

# Save report
import json

with open(f"{output_dir}/quality_report.json", "w") as f:
    json.dump(quality_report, f, indent=2)

print(f"\n‚úÖ Quality report saved to: {output_dir}/quality_report.json")

## Part 4: Data Validation Strategies

Don't just collect data - **validate** it!

### Validation Checks to Implement

#### 1. Range Checks
```python
# Ensure values within expected range
def validate_range(df, column, min_val, max_val):
    invalid = ~df[column].between(min_val, max_val)
    if invalid.any():
        print(f"‚ö†Ô∏è {invalid.sum()} invalid values in {column}")
        print(f"Expected: {min_val} to {max_val}")
        print(f"Found: {df.loc[invalid, column].values}")
    return ~invalid
```

#### 2. Type Checks
```python
# Ensure correct data types
def validate_types(df, type_map):
    for column, expected_type in type_map.items():
        if df[column].dtype != expected_type:
            print(f"‚ö†Ô∏è {column}: Expected {expected_type}, got {df[column].dtype}")
```

#### 3. Format Checks
```python
# Check string formats (emails, phone numbers, etc.)
import re

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

df['valid_email'] = df['email'].apply(validate_email)
```

#### 4. Consistency Checks
```python
# Check logical consistency
# Example: Start date should be before end date
invalid = df['start_date'] > df['end_date']
if invalid.any():
    print(f"‚ö†Ô∏è {invalid.sum()} records with start_date after end_date")
```

#### 5. Referential Integrity
```python
# Check foreign keys exist
valid_ids = master_table['id'].unique()
invalid = ~df['foreign_id'].isin(valid_ids)
if invalid.any():
    print(f"‚ö†Ô∏è {invalid.sum()} records with invalid foreign keys")
```

### Automated Validation Pipeline

Create a validation pipeline that runs automatically:

```python
class DataValidator:
    def __init__(self, df):
        self.df = df
        self.errors = []
    
    def add_check(self, name, condition, message):
        if not condition:
            self.errors.append(f"{name}: {message}")
    
    def validate(self):
        # Run all checks
        self.check_required_columns()
        self.check_data_types()
        self.check_ranges()
        self.check_duplicates()
        
        if len(self.errors) == 0:
            print("‚úÖ All validation checks passed!")
            return True
        else:
            print(f"‚ùå Found {len(self.errors)} validation errors:")
            for error in self.errors:
                print(f"  - {error}")
            return False
```

## Part 5: Documentation During Collection

**Document EVERYTHING as you collect data!**

### What to Document

#### 1. Data Dictionary
```
Column Name  | Data Type | Description           | Valid Range | Missing Allowed?
-------------|-----------|----------------------|-------------|------------------
user_id      | int       | Unique user ID       | > 0         | No
age          | int       | User age in years    | 0-120       | Yes
signup_date  | date      | Account creation     | Any         | No
```

#### 2. Collection Metadata
- **When**: Date/time of collection
- **Who**: Researcher name
- **How**: Method used
- **Where**: Source/location
- **Why**: Purpose

#### 3. Processing Log
```
2024-01-15 10:00 - Collected raw data from API
2024-01-15 10:15 - Removed 5 duplicate records
2024-01-15 10:20 - Imputed 12 missing age values with median
2024-01-15 10:25 - Validated all records
```

#### 4. Quality Issues Log
```
Issue ID | Date       | Type        | Description              | Resolution
---------|------------|-------------|--------------------------|------------
001      | 2024-01-15 | Missing     | 10% of ages missing      | Imputed with median
002      | 2024-01-15 | Invalid     | 3 ages > 120             | Removed as outliers
003      | 2024-01-16 | Duplicate   | 5 duplicate user IDs     | Kept most recent
```

### Version Control for Data

Track changes to your datasets:
```
data/
‚îú‚îÄ‚îÄ v1.0_raw_collected_2024-01-15.csv
‚îú‚îÄ‚îÄ v1.1_cleaned_2024-01-16.csv
‚îú‚îÄ‚îÄ v1.2_validated_2024-01-17.csv
‚îî‚îÄ‚îÄ CHANGELOG.md
```

### CHANGELOG.md Example
```markdown
# Data Changelog

## v1.2 - 2024-01-17
- Validated all records
- Added data quality flags

## v1.1 - 2024-01-16
- Cleaned missing values
- Removed duplicates
- Fixed data types

## v1.0 - 2024-01-15
- Initial raw data collection
- 1000 records from API
```

In [None]:
# ========================================
# Create Data Collection Documentation Template
# ========================================

documentation_template = """
DATA COLLECTION DOCUMENTATION
==============================

PROJECT INFORMATION
-------------------
Project Name: _____________________________________________
Researcher(s): ____________________________________________
Date Started: _____________________________________________
Date Completed: ___________________________________________

DATA SOURCE
-----------
Type: ‚ñ° Primary  ‚ñ° Secondary
Source: ___________________________________________________
Collection Method: ________________________________________
Access Date: ______________________________________________

SAMPLE INFORMATION
------------------
Total Records: ____________________________________________
Time Period: ______________________________________________
Geographic Coverage: ______________________________________
Sampling Method: __________________________________________

DATA DICTIONARY
---------------
Variable Name | Type | Description | Valid Range | Missing OK?
--------------|------|-------------|-------------|------------
_______________|______|_____________|_____________|____________
_______________|______|_____________|_____________|____________
_______________|______|_____________|_____________|____________

QUALITY ASSESSMENT
------------------
Completeness: _______ % (target: > 95%)
Duplicates: _________ records (target: 0)
Invalid Values: _____ records (target: 0)
Overall Quality Score: _____ / 100

PROCESSING STEPS
----------------
1. ________________________________________________________
2. ________________________________________________________
3. ________________________________________________________

KNOWN ISSUES
------------
Issue 1: __________________________________________________
Resolution: _______________________________________________

Issue 2: __________________________________________________
Resolution: _______________________________________________

VALIDATION
----------
‚ñ° Range checks performed
‚ñ° Type checks performed
‚ñ° Duplicate checks performed
‚ñ° Consistency checks performed
‚ñ° Referential integrity verified

ETHICAL CONSIDERATIONS
----------------------
IRB Approval: ‚ñ° Yes  ‚ñ° No  ‚ñ° Not Required
Informed Consent: ‚ñ° Yes  ‚ñ° No  ‚ñ° Not Applicable
Privacy Protection: ________________________________________
Data Anonymization: ________________________________________

FILE INFORMATION
----------------
File Name: ________________________________________________
File Format: ______________________________________________
File Size: ________________________________________________
Location: _________________________________________________
Backup Location: __________________________________________

VERSION HISTORY
---------------
v1.0 - ____-__-__ : _______________________________________
v1.1 - ____-__-__ : _______________________________________
v1.2 - ____-__-__ : _______________________________________

NOTES
-----
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
"""

# Save template
with open(f"{output_dir}/data_collection_documentation.txt", "w") as f:
    f.write(documentation_template)

print(documentation_template)
print(f"\n‚úÖ Template saved to: {output_dir}/data_collection_documentation.txt")
print("\nUse this template for EVERY dataset you collect!")

---

## Summary

Congratulations on completing Module 05!

### Key Takeaways

‚úÖ **Data Sources**: Primary (collect yourself) vs Secondary (use existing)

‚úÖ **Collection Methods**: Experiments, surveys, observations, sensors, interviews

‚úÖ **6 Quality Dimensions**: Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness

‚úÖ **Validation**: Implement automated checks for range, type, format, consistency

‚úÖ **Documentation**: Record everything - collection process, issues, resolutions

### What You Can Do Now

- Choose appropriate data sources for your research
- Select collection methods based on research type
- Assess data quality across 6 dimensions
- Implement validation pipelines
- Document data collection thoroughly
- Track data versions

### Practice Exercise

**Exercise**: Plan Your Data Collection

Using your research design from Module 04:

1. Decide: Primary, secondary, or both?
2. Choose collection method(s)
3. Create a data dictionary
4. List validation checks needed
5. Fill out documentation template
6. Plan quality assurance steps

Save this - you'll use it in Module 09!

---

### Up Next

In **Module 06: Research Ethics**, you'll learn:
- Ethical principles in research
- Privacy and confidentiality
- Informed consent
- Bias and fairness
- Responsible data use

---

**Ready to continue?** Move on to `06_research_ethics.ipynb`!

**Need to review?** Go back to any section that needs more attention.