<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/exercises/Review_05_Data_Ethics_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WRIT 20833 Review 05: Data Ethics & Collection Methods

**Student Name:** ___________________  
**Date:** ___________________  

Explore ethical data collection and responsible research practices.

**Make a copy:** File > Save a copy in Drive

## Exercise 1: Understanding Data Sources
Analyze different types of cultural data and their origins.

In [None]:
# Different types of cultural data sources
data_sources = {
    "social_media": {
        "type": "User-generated content",
        "examples": ["Twitter posts", "Instagram captions", "TikTok comments"],
        "ethical_concerns": ["Privacy", "Consent", "Context collapse"],
        "access_method": "APIs or scraping"
    },
    "historical_archives": {
        "type": "Digitized materials", 
        "examples": ["Letters", "Newspapers", "Government records"],
        "ethical_concerns": ["Copyright", "Representation", "Missing voices"],
        "access_method": "Digital libraries"
    },
    "interviews": {
        "type": "Collected testimony",
        "examples": ["Oral histories", "Surveys", "Focus groups"],
        "ethical_concerns": ["Informed consent", "Anonymity", "Power dynamics"],
        "access_method": "Direct collection"
    }
}

# Analyze each source type
print("CULTURAL DATA SOURCE ANALYSIS")
print("=" * 40)

for source_name in data_sources:
    details = data_sources[source_name]
    print()
    print(source_name.replace("_", " ").title() + ":")
    print("  Type: " + details["type"])
    print("  Examples: " + str(details["examples"]))
    print("  Access: " + details["access_method"])
    print("  Key ethical concerns: " + str(details["ethical_concerns"]))

# Count ethical concerns
all_concerns = []
for source_name in data_sources:
    details = data_sources[source_name]
    for concern in details["ethical_concerns"]:
        if concern not in all_concerns:
            all_concerns.append(concern)

print()
print("Total unique ethical concerns identified: " + str(len(all_concerns)))
print("All concerns: " + str(all_concerns))

## Exercise 2: Evaluating Data Collection Methods
Practice assessing the ethics of different collection approaches.

In [None]:
# Function to evaluate data collection ethics (simplified)
def evaluate_collection_method(method_name, description, consent_level, privacy_impact, potential_harm):
    """Evaluate the ethical implications of a data collection method""" 
    
    # Calculate ethics score (higher = more ethical)
    consent_score = 0
    if consent_level == "explicit":
        consent_score = 3
    elif consent_level == "implied":
        consent_score = 2
    else:
        consent_score = 0
    
    privacy_score = 0
    if privacy_impact == "low":
        privacy_score = 3
    elif privacy_impact == "medium":
        privacy_score = 2
    else:
        privacy_score = 0
    
    harm_score = 0
    if potential_harm == "minimal":
        harm_score = 3
    elif potential_harm == "moderate":
        harm_score = 2
    else:
        harm_score = 0
    
    total_score = consent_score + privacy_score + harm_score
    max_score = 9
    
    # Determine ethics rating
    if total_score >= 8:
        rating = "Highly Ethical"
    elif total_score >= 6:
        rating = "Moderately Ethical"
    elif total_score >= 4:
        rating = "Ethically Questionable"
    else:
        rating = "Ethically Problematic"
    
    print("Method: " + method_name)
    print("Description: " + description)
    print("Consent Level: " + consent_level)
    print("Privacy Impact: " + privacy_impact)
    print("Potential Harm: " + potential_harm)
    print("Ethics Score: " + str(total_score) + "/" + str(max_score))
    print("Rating: " + rating)
    print("-" * 30)
    
    return {"method": method_name, "score": total_score, "rating": rating}

# Evaluate different collection methods
print("ETHICAL EVALUATION OF COLLECTION METHODS")
print()

# Method 1: Voluntary Survey
result1 = evaluate_collection_method(
    "Voluntary Survey",
    "Participants voluntarily complete a survey about their cultural practices",
    "explicit",
    "low", 
    "minimal"
)

# Method 2: Public Social Media Scraping
result2 = evaluate_collection_method(
    "Public Social Media Scraping",
    "Collecting public posts from social media without notification",
    "none",
    "medium",
    "moderate"
)

# Method 3: Historical Archive Digitization
result3 = evaluate_collection_method(
    "Historical Archive Digitization", 
    "Digitizing letters and documents from historical archives",
    "implied",
    "low",
    "minimal"
)

# Summary of results
results = [result1, result2, result3]
print("SUMMARY OF ETHICAL EVALUATIONS:")
for result in results:
    print(result["method"] + ": " + result["rating"] + " (Score: " + str(result["score"]) + "/9)")

## Exercise 3: Consent and Privacy Analysis
Examine consent models and privacy considerations.

In [None]:
# Different models of consent (simplified)
consent_models = {
    "opt_in": {
        "description": "Users must actively choose to participate",
        "advantages": ["Clear consent", "Informed participation", "Higher ethical standard"],
        "disadvantages": ["Lower participation rates", "Selection bias"],
        "best_for": ["Sensitive topics", "Vulnerable populations"]
    },
    "opt_out": {
        "description": "Users are included by default but can choose to leave",
        "advantages": ["Higher participation", "More representative samples"],
        "disadvantages": ["Questionable consent", "Ethical concerns"],
        "best_for": ["Low-risk research", "Public data analysis"]
    },
    "public_domain": {
        "description": "Data is already publicly available",
        "advantages": ["No consent needed", "Large datasets"],
        "disadvantages": ["Context collapse", "Privacy erosion"],
        "best_for": ["Historical analysis", "Public discourse studies"]
    }
}

# Function to analyze consent appropriateness (simplified)
def analyze_consent_model(research_context, data_sensitivity, population_vulnerability):
    """Recommend appropriate consent model based on research parameters""" 
    
    # Simple decision logic
    if data_sensitivity == "high" or population_vulnerability == "high":
        recommendation = "opt_in"
        reason = "High sensitivity or vulnerable population requires explicit consent"
    elif research_context == "historical" and data_sensitivity == "low":
        recommendation = "public_domain"
        reason = "Historical public data with low sensitivity"
    else:
        recommendation = "opt_out"
        reason = "Low risk factors allow for opt-out model"
    
    return recommendation, reason

# Test different research scenarios
print("CONSENT MODEL RECOMMENDATIONS")
print("=" * 40)

# Scenario 1
print()
print("Scenario: Analyzing public tweets about movies")
print("Context: social_media | Sensitivity: medium | Vulnerability: low")
recommendation, reason = analyze_consent_model("social_media", "medium", "low")
print("Recommended Model: " + recommendation.replace("_", "-").title())
print("Reason: " + reason)
model_info = consent_models[recommendation]
print("Model Description: " + model_info["description"])

# Scenario 2
print()
print("Scenario: Interviewing trauma survivors")
print("Context: interviews | Sensitivity: high | Vulnerability: high")
recommendation, reason = analyze_consent_model("interviews", "high", "high")
print("Recommended Model: " + recommendation.replace("_", "-").title())
print("Reason: " + reason)
model_info = consent_models[recommendation]
print("Model Description: " + model_info["description"])

# Scenario 3
print()
print("Scenario: Analyzing 19th century newspapers")
print("Context: historical | Sensitivity: low | Vulnerability: low")
recommendation, reason = analyze_consent_model("historical", "low", "low")
print("Recommended Model: " + recommendation.replace("_", "-").title())
print("Reason: " + reason)
model_info = consent_models[recommendation]
print("Model Description: " + model_info["description"])

## Exercise 4: Bias and Representation
Identify potential biases in cultural datasets.

In [None]:
# Function to analyze dataset representation (simplified)
def analyze_dataset_bias(dataset_name, collection_method, source_demographics, missing_groups):
    """Analyze potential biases in cultural datasets""" 
    
    # Identify bias types
    bias_types = []
    
    if "online" in collection_method.lower():
        bias_types.append("Digital divide bias")
    
    if "english" in source_demographics.lower():
        bias_types.append("Language bias")
    
    if "urban" in source_demographics.lower():
        bias_types.append("Geographic bias")
    
    if "college" in source_demographics.lower() or "university" in source_demographics.lower():
        bias_types.append("Educational bias")
    
    if missing_groups:
        bias_types.append("Systematic exclusion bias")
    
    # Calculate bias risk
    if len(bias_types) <= 1:
        risk_level = "Low"
    elif len(bias_types) <= 3:
        risk_level = "Medium"
    else:
        risk_level = "High"
    
    print("Dataset: " + dataset_name)
    print("Collection Method: " + collection_method)
    print("Source Demographics: " + source_demographics)
    if missing_groups:
        print("Missing Groups: " + missing_groups)
    else:
        print("Missing Groups: None identified")
    print("Identified Bias Types: " + str(bias_types))
    print("Bias Risk Level: " + risk_level)
    print("-" * 40)
    
    return {"dataset": dataset_name, "bias_count": len(bias_types), "risk": risk_level}

# Sample datasets to analyze
print("DATASET BIAS ANALYSIS")
print("=" * 40)

# Dataset 1: Twitter Literature Discussions
result1 = analyze_dataset_bias(
    "Twitter Literature Discussions",
    "Online social media scraping",
    "Primarily English-speaking, urban, college-educated users",
    "Rural communities, non-English speakers, older adults"
)

# Dataset 2: Historical Newspaper Archive
result2 = analyze_dataset_bias(
    "Historical Newspaper Archive",
    "Digital archive access",
    "Major city newspapers, English language, 1900-2000",
    "Community papers, minority-owned publications, non-English press"
)

# Dataset 3: Community Survey
result3 = analyze_dataset_bias(
    "Community Survey on Cultural Practices",
    "Door-to-door interviews in multiple languages",
    "Representative sample across age, income, ethnicity, geography",
    ""
)

# Summary statistics 
results = [result1, result2, result3]
high_risk = 0
medium_risk = 0
low_risk = 0

for r in results:
    if r["risk"] == "High":
        high_risk = high_risk + 1
    elif r["risk"] == "Medium":
        medium_risk = medium_risk + 1
    else:
        low_risk = low_risk + 1

print()
print("BIAS RISK SUMMARY:")
print("High Risk: " + str(high_risk) + " datasets")
print("Medium Risk: " + str(medium_risk) + " datasets")  
print("Low Risk: " + str(low_risk) + " datasets")
print()
print("Recommendation: Focus mitigation efforts on " + str(high_risk + medium_risk) + " datasets with elevated bias risk.")

## Exercise 5: Ethical Decision Framework
Practice making ethical decisions about data use.

In [None]:
# Ethical decision-making framework (simplified)
def ethical_decision_framework(research_question, data_source, potential_benefits, potential_harms, alternatives):
    """Guide ethical decision-making about data use""" 
    
    print("ETHICAL DECISION FRAMEWORK")
    print("Research Question: " + research_question)
    print("Proposed Data Source: " + data_source)
    print()
    
    # Step 1: Benefits analysis
    print("STEP 1: Benefits Analysis")
    for i in range(len(potential_benefits)):
        print("  " + str(i+1) + ". " + potential_benefits[i])
    
    # Step 2: Harm assessment
    print()
    print("STEP 2: Potential Harms")
    for i in range(len(potential_harms)):
        print("  " + str(i+1) + ". " + potential_harms[i])
    
    # Step 3: Alternative approaches
    print()
    print("STEP 3: Alternative Approaches")
    for i in range(len(alternatives)):
        print("  " + str(i+1) + ". " + alternatives[i])
    
    # Step 4: Decision guidance
    print()
    print("STEP 4: Decision Guidance Questions")
    questions = [
        "Do the benefits clearly outweigh the harms?",
        "Have you minimized potential harms through design choices?",
        "Are there less harmful alternatives that could answer your question?",
        "Would the people whose data you're using consent if they knew?",
        "Does your research serve the interests of the communities studied?"
    ]
    
    for i in range(len(questions)):
        print("  " + str(i+1) + ". " + questions[i])
    
    # Simple recommendation logic
    harm_count = len(potential_harms)
    alt_count = len(alternatives)
    
    if harm_count <= 1 and alt_count >= 2:
        recommendation = "Consider alternatives first"
    elif harm_count >= 3:
        recommendation = "High risk - requires strong justification"
    else:
        recommendation = "Proceed with careful safeguards"
    
    print()
    print("INITIAL RECOMMENDATION: " + recommendation)
    print()
    print("Next Steps: Consult with IRB, advisors, and community stakeholders.")
    
    return recommendation

# Test case: Social media research
recommendation = ethical_decision_framework(
    "How do young people discuss mental health on social media?",
    "Public Twitter posts containing mental health keywords",
    [
        "Better understanding of youth mental health discourse",
        "Inform mental health support programs",
        "Identify patterns that could help early intervention"
    ],
    [
        "Privacy violation for vulnerable individuals",
        "Risk of re-identification despite public posts", 
        "Potential stigmatization of communities",
        "Taking posts out of original context"
    ],
    [
        "Partner with mental health organizations for voluntary participation",
        "Use synthetic data based on patterns rather than actual posts",
        "Focus on aggregate trends rather than individual posts",
        "Conduct interviews with explicit informed consent"
    ]
)

## Exercise 6: Your Research Ethics Plan
Develop an ethics plan for your own research interests.

In [None]:
# TODO: Define your research area and data needs
your_research = {
    "field": "Your field of study (e.g., literature, history, art, etc.)",
    "question": "Your specific research question",
    "data_needed": "What kind of data would help answer your question?",
    "population": "Who/what would you be studying?",
    "timeframe": "Historical period or contemporary?"
}

# TODO: Identify potential data sources for your research
potential_sources = [
    # Add your potential data sources here
    # Examples: "Digital archives", "Social media posts", "Interviews", etc.
]

# TODO: Consider ethical implications
ethical_considerations = {
    "consent_challenges": [],  # What makes consent difficult in your field?
    "privacy_risks": [],       # What privacy risks exist?
    "representation_gaps": [], # Who might be excluded from your data?
    "potential_harms": [],     # How could your research cause harm?
    "community_benefits": []   # How does your research serve the communities studied?
}

# TODO: Develop mitigation strategies
mitigation_strategies = [
    # Add your strategies for addressing ethical concerns
    # Examples: "Partner with community organizations", "Use anonymization", etc.
]

# Function to display your ethics plan
def display_ethics_plan(research_info, sources, considerations, mitigations):
    print("YOUR RESEARCH ETHICS PLAN")
    print("=" * 40)
    
    print("RESEARCH OVERVIEW:")
    for key in research_info:
        value = research_info[key]
        print("  " + key.replace("_", " ").title() + ": " + value)
    
    print()
    print("POTENTIAL DATA SOURCES:")
    for i in range(len(sources)):
        print("  " + str(i+1) + ". " + sources[i])
    
    print()
    print("ETHICAL CONSIDERATIONS:")
    for category in considerations:
        items = considerations[category]
        if items:  # Only show categories that have items
            print("  " + category.replace("_", " ").title() + ":")
            for item in items:
                print("    - " + item)
    
    print()
    print("MITIGATION STRATEGIES:")
    for i in range(len(mitigations)):
        print("  " + str(i+1) + ". " + mitigations[i])
    
    print()
    print("NEXT STEPS:")
    print("  1. Consult with faculty advisor about ethical considerations")
    print("  2. Research IRB requirements for your institution")
    print("  3. Identify relevant community stakeholders to consult")
    print("  4. Develop detailed data management and privacy protocols")

# Display your plan (will show placeholder text until you customize it)  
display_ethics_plan(your_research, potential_sources, ethical_considerations, mitigation_strategies)

print()
print("=" * 40)
print("REFLECTION: Customize the variables above with your specific research interests and ethical considerations.")

## Summary

You explored:
- Different types of cultural data sources and their ethical implications
- Methods for evaluating data collection approaches
- Consent models and privacy considerations
- Identifying and addressing bias in datasets
- Frameworks for ethical decision-making
- Developing ethics plans for your own research

**Key Principles:**
- Prioritize consent and transparency
- Consider potential harms and benefits
- Address representation and bias
- Serve the communities you study
- Consult with stakeholders and ethics boards

**Next:** Review 06 will cover Pandas for data analysis.

 