# Module 04: Survey Design & Measurement

**Estimated Time**: 50 minutes

## Learning Objectives

By the end of this module, you will be able to:

1. **Design** effective survey questions that minimize bias and maximize clarity
2. **Select** appropriate response scales for different question types
3. **Evaluate** psychometric properties: reliability (consistency) and validity (accuracy)
4. **Calculate** Cronbach's alpha and other reliability metrics
5. **Identify** and mitigate common survey biases (acquiescence, social desirability, etc.)
6. **Distinguish** between measurement scales (nominal, ordinal, interval, ratio)
7. **Conduct** item analysis to refine survey instruments
8. **Implement** survey pretesting and cognitive interviewing

## Why This Matters

**Bad measurement = bad data = bad science**

Even the most sophisticated analysis cannot overcome poor measurement:
- Ambiguous questions lead to meaningless responses
- Biased questions distort reality
- Unreliable measures add noise and reduce power
- Invalid measures answer the wrong question

This module teaches you to create measurement instruments that are:
- **Reliable**: Consistent across time and contexts
- **Valid**: Actually measure what they claim to measure
- **Unbiased**: Minimize systematic distortions
- **Actionable**: Produce data you can trust

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import pearsonr, spearmanr
import warnings

warnings.filterwarnings("ignore")

# Set style
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")

# Set random seed
np.random.seed(42)

# Create output directory
import os

os.makedirs("outputs/module_04", exist_ok=True)

print("‚úì Libraries imported successfully")
print("‚úì Output directory created")

## 1. Question Design Principles

### The Seven Deadly Sins of Survey Questions

#### 1. Double-Barreled Questions
**Bad**: "Do you think the government should increase taxes and improve healthcare?"
- Problem: Two questions in one. What if someone supports one but not the other?

**Good**: Split into two questions:
- "Should the government increase taxes?"
- "Should the government improve healthcare?"

#### 2. Leading Questions
**Bad**: "Don't you agree that climate change is the most important issue facing humanity?"
- Problem: Guides respondent toward a particular answer

**Good**: "How important is climate change relative to other issues?"

#### 3. Loaded Questions
**Bad**: "Should we allow dangerous criminals to walk free?"
- Problem: Emotional language biases response

**Good**: "Should non-violent offenders be eligible for early release?"

#### 4. Ambiguous Questions
**Bad**: "How often do you exercise?"
- Problem: What counts as exercise? Walking? Housework? Sports?

**Good**: "In a typical week, on how many days do you engage in at least 30 minutes of moderate-to-vigorous physical activity (e.g., brisk walking, jogging, cycling, swimming)?"

#### 5. Negatively Worded Questions
**Bad**: "To what extent do you disagree that you are not dissatisfied with our service?"
- Problem: Double/triple negatives confuse respondents

**Good**: "How satisfied are you with our service?"

#### 6. Questions Assuming Knowledge
**Bad**: "What is your opinion on the proposed amendment to Section 1031 of the Internal Revenue Code?"
- Problem: Most respondents won't know what this is

**Good**: Include brief explanation or add "Don't know" option

#### 7. Questions Beyond Recall Ability
**Bad**: "How many times did you sneeze in the past year?"
- Problem: No one can accurately remember this

**Good**: Shorten timeframe: "How many times did you sneeze today?"

In [None]:
# Demonstrate impact of question wording on responses

# Simulate responses to differently worded questions on same topic
np.random.seed(123)
n_respondents = 200

# Scenario: Support for environmental policy
# True underlying support: 60% (what we'd get with neutral wording)
true_support_rate = 0.60

# Neutral wording: "Should the government regulate carbon emissions?"
neutral_responses = np.random.binomial(1, true_support_rate, n_respondents)

# Leading/positive wording: "Should the government protect our environment by regulating carbon emissions?"
leading_positive = np.random.binomial(
    1, true_support_rate + 0.20, n_respondents
)  # Inflates support

# Leading/negative wording: "Should the government impose costly regulations on carbon emissions?"
leading_negative = np.random.binomial(
    1, true_support_rate - 0.25, n_respondents
)  # Deflates support

# Ambiguous wording: "Should the government do something about pollution?"
ambiguous = np.random.binomial(
    1, true_support_rate + 0.15, n_respondents
)  # Vague ‚Üí higher agreement

# Create summary
wording_comparison = pd.DataFrame(
    {
        "Question Wording": ["Neutral", "Leading (Positive)", "Leading (Negative)", "Ambiguous"],
        "Support Rate (%)": [
            neutral_responses.mean() * 100,
            leading_positive.mean() * 100,
            leading_negative.mean() * 100,
            ambiguous.mean() * 100,
        ],
    }
)

print("Impact of Question Wording on Response Rates:\n")
print(wording_comparison.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))

colors = ["#06A77D", "#F4A261", "#E63946", "#457B9D"]
bars = ax.barh(
    wording_comparison["Question Wording"],
    wording_comparison["Support Rate (%)"],
    color=colors,
    edgecolor="black",
    linewidth=1.5,
)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, wording_comparison["Support Rate (%)"])):
    ax.text(val + 1, i, f"{val:.1f}%", va="center", fontweight="bold")

# Mark true rate
ax.axvline(
    x=true_support_rate * 100,
    color="black",
    linestyle="--",
    linewidth=2,
    label=f"True Support ({true_support_rate*100:.0f}%)",
)

ax.set_xlabel("Support Rate (%)", fontsize=12, fontweight="bold")
ax.set_title("How Question Wording Affects Survey Responses", fontsize=14, fontweight="bold")
ax.set_xlim([0, 100])
ax.legend()
ax.grid(True, alpha=0.3, axis="x")

plt.tight_layout()
plt.savefig("outputs/module_04/question_wording_bias.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° The same underlying attitude produces wildly different results")
print("   depending on how the question is worded!")

## 2. Response Scales and Formats

### Common Scale Types

#### Likert Scales
**Usage**: Measuring agreement, frequency, importance, satisfaction

**Example (5-point agreement)**:
1. Strongly Disagree
2. Disagree
3. Neither Agree nor Disagree
4. Agree
5. Strongly Agree

**Design Choices**:
- **Number of points**: 5-7 is standard (more points ‚â† more precision)
- **Middle option**: Include neutral option? (Yes: allows true neutrality; No: forces choice)
- **Labels**: Label all points vs. only endpoints

#### Semantic Differential Scales
**Usage**: Measuring attitudes along bipolar dimensions

**Example**:
```
Please rate this product:
Inexpensive  1  2  3  4  5  6  7  Expensive
Low Quality  1  2  3  4  5  6  7  High Quality
Unattractive 1  2  3  4  5  6  7  Attractive
```

#### Visual Analog Scales (VAS)
**Usage**: Continuous measurement (pain, satisfaction, emotion)

**Example**:
```
How much pain are you experiencing?
|------------------------------------------------|
No pain                                  Worst pain imaginable
```

Respondent marks on the line; researcher measures distance.

#### Rating Scales
**Usage**: Evaluating specific attributes

**Example (0-10 scale)**:
"On a scale from 0 to 10, where 0 is 'not at all likely' and 10 is 'extremely likely', how likely are you to recommend our product?"

### Scale Selection Guidelines

| Construct | Recommended Scale | Rationale |
|-----------|------------------|----------|
| Agreement | 5-point Likert | Standard, well-understood |
| Frequency | 5-point Likert | Ordinal categories (Never to Always) |
| Satisfaction | 5 or 7-point Likert | Allows nuance |
| Pain/Discomfort | VAS or 0-10 | Continuous, sensitive |
| Binary choice | Yes/No | When middle ground doesn't exist |
| Ranking | Drag-and-drop or numbered | When priorities matter |

In [None]:
# Compare different scale formats on same construct

np.random.seed(456)
n = 150

# True underlying satisfaction (0-100 scale)
true_satisfaction = np.random.beta(2, 2, n) * 100  # Beta distribution for realistic spread

# Convert to different scale formats

# 5-point Likert (1-5)
likert_5 = np.digitize(true_satisfaction, bins=[0, 20, 40, 60, 80, 100])

# 7-point Likert (1-7)
likert_7 = np.digitize(true_satisfaction, bins=[0, 14.3, 28.6, 42.9, 57.2, 71.5, 85.8, 100])

# Binary (0-1)
binary = (true_satisfaction >= 50).astype(int)

# 0-10 rating
rating_10 = np.round(true_satisfaction / 10).astype(int)
rating_10 = np.clip(rating_10, 0, 10)

# Create dataframe
df_scales = pd.DataFrame(
    {
        "True_Satisfaction": true_satisfaction,
        "Likert_5": likert_5,
        "Likert_7": likert_7,
        "Binary": binary,
        "Rating_10": rating_10,
    }
)

# Calculate information loss (correlation with true score)
correlations = {
    "5-Point Likert": pearsonr(df_scales["True_Satisfaction"], df_scales["Likert_5"])[0],
    "7-Point Likert": pearsonr(df_scales["True_Satisfaction"], df_scales["Likert_7"])[0],
    "Binary (Yes/No)": pearsonr(df_scales["True_Satisfaction"], df_scales["Binary"])[0],
    "0-10 Rating": pearsonr(df_scales["True_Satisfaction"], df_scales["Rating_10"])[0],
}

print("Correlation with True Satisfaction (higher = less information loss):\n")
for scale, corr in sorted(correlations.items(), key=lambda x: x[1], reverse=True):
    print(f"{scale:20s}: r = {corr:.3f}")

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

scale_names = ["Likert_5", "Likert_7", "Binary", "Rating_10"]
scale_labels = ["5-Point Likert", "7-Point Likert", "Binary (Yes/No)", "0-10 Rating"]
colors_palette = ["#E63946", "#F4A261", "#06A77D", "#457B9D"]

for i, (scale_name, label, color) in enumerate(zip(scale_names, scale_labels, colors_palette)):
    # Count frequencies
    value_counts = df_scales[scale_name].value_counts().sort_index()

    axes[i].bar(
        value_counts.index,
        value_counts.values,
        color=color,
        alpha=0.7,
        edgecolor="black",
        linewidth=1.5,
    )

    axes[i].set_xlabel("Response Value", fontsize=11, fontweight="bold")
    axes[i].set_ylabel("Frequency", fontsize=11, fontweight="bold")
    axes[i].set_title(
        f"{label}\n(r = {correlations[label]:.3f} with true score)", fontsize=12, fontweight="bold"
    )
    axes[i].grid(True, alpha=0.3, axis="y")

plt.suptitle("Response Distributions Across Scale Formats", fontsize=15, fontweight="bold", y=1.00)
plt.tight_layout()
plt.savefig("outputs/module_04/scale_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° More response options generally preserve more information,")
print("   but gains diminish beyond 7-10 points.")

## 3. Psychometric Properties: Reliability

**Reliability** = Consistency of measurement

A reliable measure produces similar results when:
- The same person completes it multiple times (test-retest reliability)
- Different items measure the same construct (internal consistency)
- Different raters evaluate the same thing (inter-rater reliability)

### Types of Reliability

#### 1. Test-Retest Reliability
**Method**: Administer same survey to same people at two time points
**Metric**: Correlation between Time 1 and Time 2 scores
**Interpretation**: r > 0.70 is acceptable

#### 2. Internal Consistency Reliability
**Method**: Examine how well items measuring the same construct correlate
**Metrics**:
- **Cronbach's Alpha (Œ±)**: Most common
- **Split-half reliability**: Correlation between two halves of scale

**Cronbach's Alpha formula**:

$$\alpha = \frac{k}{k-1} \left(1 - \frac{\sum_{i=1}^{k} \sigma_{i}^{2}}{\sigma_{\text{total}}^{2}}\right)$$

Where:
- $k$ = number of items
- $\sigma_{i}^{2}$ = variance of item $i$
- $\sigma_{\text{total}}^{2}$ = variance of total scale scores

**Interpretation**:
- Œ± < 0.60: Unacceptable
- Œ± = 0.60-0.70: Questionable
- Œ± = 0.70-0.80: Acceptable
- Œ± = 0.80-0.90: Good
- Œ± > 0.90: Excellent (but check for redundancy)

#### 3. Inter-Rater Reliability
**Method**: Multiple raters evaluate same targets
**Metrics**:
- **Cohen's Kappa (Œ∫)**: For categorical ratings
- **Intraclass Correlation (ICC)**: For continuous ratings

In [None]:
# Calculate Cronbach's Alpha for a multi-item scale


def cronbach_alpha(data):
    """
    Calculate Cronbach's Alpha for internal consistency.

    Parameters:
    - data: DataFrame or 2D array where rows = respondents, columns = items

    Returns:
    - alpha: Cronbach's alpha coefficient
    """
    if isinstance(data, pd.DataFrame):
        data = data.values

    # Number of items
    k = data.shape[1]

    # Variance of each item
    item_variances = np.var(data, axis=0, ddof=1)

    # Variance of total scores (sum across items)
    total_scores = np.sum(data, axis=1)
    total_variance = np.var(total_scores, ddof=1)

    # Cronbach's alpha
    alpha = (k / (k - 1)) * (1 - np.sum(item_variances) / total_variance)

    return alpha


# Simulate survey data: Depression scale with 8 items
np.random.seed(789)
n_respondents = 250
n_items = 8

# Each person has true depression level (latent variable)
true_depression = np.random.normal(50, 15, n_respondents)

# Items are noisy measurements of true depression
# Good scale: Items highly correlated with true score
item_responses = np.zeros((n_respondents, n_items))

for i in range(n_items):
    # Each item = true score + item-specific noise
    item_responses[:, i] = true_depression + np.random.normal(0, 8, n_respondents)
    # Convert to 1-5 Likert scale
    item_responses[:, i] = np.digitize(item_responses[:, i], bins=[0, 35, 45, 55, 65, 100])

# Create dataframe
item_names = [f"Item_{i+1}" for i in range(n_items)]
df_depression = pd.DataFrame(item_responses, columns=item_names)

print("Depression Scale Data (first 10 respondents):\n")
print(df_depression.head(10))

# Calculate Cronbach's alpha
alpha = cronbach_alpha(df_depression)

print(f"\n" + "=" * 60)
print(f"RELIABILITY ANALYSIS")
print("=" * 60)
print(f"\nCronbach's Alpha: Œ± = {alpha:.3f}")

if alpha >= 0.90:
    interpretation = "Excellent (consider removing redundant items)"
elif alpha >= 0.80:
    interpretation = "Good"
elif alpha >= 0.70:
    interpretation = "Acceptable"
elif alpha >= 0.60:
    interpretation = "Questionable"
else:
    interpretation = "Unacceptable"

print(f"Interpretation: {interpretation}")
print(f"\nNumber of items: {n_items}")
print(f"Number of respondents: {n_respondents}")

# Item-total correlations (how well each item correlates with total score)
total_score = df_depression.sum(axis=1)
item_total_corr = [pearsonr(df_depression[item], total_score)[0] for item in item_names]

print(f"\nItem-Total Correlations:")
for item, corr in zip(item_names, item_total_corr):
    print(f"  {item}: r = {corr:.3f}")

In [None]:
# Alpha if item deleted analysis
# Shows how alpha changes if each item is removed

alpha_if_deleted = []

for i in range(n_items):
    # Create dataset without item i
    items_subset = df_depression.drop(columns=[item_names[i]])
    alpha_deleted = cronbach_alpha(items_subset)
    alpha_if_deleted.append(alpha_deleted)

# Create summary table
item_analysis = pd.DataFrame(
    {
        "Item": item_names,
        "Item-Total Correlation": item_total_corr,
        "Alpha if Deleted": alpha_if_deleted,
        "Keep?": [
            "‚úì" if corr > 0.30 and alpha_del <= alpha else "‚úó Consider removing"
            for corr, alpha_del in zip(item_total_corr, alpha_if_deleted)
        ],
    }
)

print("\n" + "=" * 60)
print("ITEM ANALYSIS")
print("=" * 60)
print(f"\nCurrent Alpha: {alpha:.3f}")
print("\nItem Performance:")
print(item_analysis.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Item-total correlations
colors = ["#06A77D" if corr > 0.30 else "#E63946" for corr in item_total_corr]
axes[0].barh(item_names, item_total_corr, color=colors, edgecolor="black", linewidth=1.5, alpha=0.7)
axes[0].axvline(
    x=0.30, color="black", linestyle="--", linewidth=2, label="Minimum threshold (0.30)"
)
axes[0].set_xlabel("Item-Total Correlation", fontsize=12, fontweight="bold")
axes[0].set_title("Item-Total Correlations", fontsize=13, fontweight="bold")
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis="x")

# Panel 2: Alpha if deleted
colors2 = ["#E63946" if a > alpha else "#06A77D" for a in alpha_if_deleted]
axes[1].barh(
    item_names, alpha_if_deleted, color=colors2, edgecolor="black", linewidth=1.5, alpha=0.7
)
axes[1].axvline(
    x=alpha, color="black", linestyle="--", linewidth=2, label=f"Current Alpha ({alpha:.3f})"
)
axes[1].set_xlabel("Cronbach's Alpha if Item Deleted", fontsize=12, fontweight="bold")
axes[1].set_title("Effect of Removing Each Item", fontsize=13, fontweight="bold")
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis="x")

plt.tight_layout()
plt.savefig("outputs/module_04/reliability_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° Items with low item-total correlations (< 0.30) should be revised or removed.")
print("   If removing an item increases alpha, that item may be measuring something different.")

## 4. Psychometric Properties: Validity

**Validity** = Accuracy of measurement

A valid measure actually measures what it claims to measure.

**Important**: Reliability is necessary but not sufficient for validity.
- A bathroom scale could consistently show you weigh 150 lbs (reliable)
- But if you actually weigh 180 lbs, it's not valid

### Types of Validity

#### 1. Face Validity
**Definition**: Does it appear to measure what it claims?
**Assessment**: Subjective expert judgment
**Example**: "I feel sad" has face validity for depression

#### 2. Content Validity
**Definition**: Does it cover the full domain of the construct?
**Assessment**: Expert review
**Example**: A math test should cover all relevant topics, not just geometry

#### 3. Criterion Validity
**Definition**: Does it correlate with an external criterion?

**Subtypes**:
- **Concurrent validity**: Correlates with criterion measured at same time
  - Example: New depression scale correlates with established BDI
- **Predictive validity**: Predicts future outcomes
  - Example: SAT scores predict college GPA

#### 4. Construct Validity
**Definition**: Does it behave as theory predicts?

**Subtypes**:
- **Convergent validity**: Correlates with measures of related constructs
  - Example: Depression scale correlates with anxiety scale
- **Discriminant validity**: Does NOT correlate with unrelated constructs
  - Example: Depression scale does NOT correlate with height

In [None]:
# Demonstrate construct validity: convergent and discriminant

np.random.seed(999)
n = 200

# Latent true constructs
true_depression = np.random.normal(50, 15, n)
true_anxiety = 0.6 * true_depression + np.random.normal(20, 10, n)  # Correlated with depression
true_height = np.random.normal(170, 10, n)  # Unrelated to depression

# Measured variables (with noise)
depression_scale_new = true_depression + np.random.normal(0, 8, n)  # Our new scale
depression_scale_old = true_depression + np.random.normal(
    0, 10, n
)  # Established scale (more noise)
anxiety_scale = true_anxiety + np.random.normal(0, 8, n)
height_measured = true_height + np.random.normal(0, 2, n)

# Create dataframe
df_validity = pd.DataFrame(
    {
        "Depression_New": depression_scale_new,
        "Depression_Established": depression_scale_old,
        "Anxiety": anxiety_scale,
        "Height_cm": height_measured,
    }
)

print("Construct Validity Analysis\n")
print("=" * 60)

# Criterion validity (concurrent): Correlate with established measure
criterion_corr, criterion_p = pearsonr(
    df_validity["Depression_New"], df_validity["Depression_Established"]
)
print(f"\n1. CRITERION VALIDITY (Concurrent)")
print(f"   Correlation with established depression scale:")
print(f"   r = {criterion_corr:.3f}, p = {criterion_p:.4f}")
if criterion_corr > 0.70:
    print(f"   ‚úì Good criterion validity (r > 0.70)")
else:
    print(f"   ‚úó Questionable criterion validity (r < 0.70)")

# Convergent validity: Should correlate with related construct
convergent_corr, convergent_p = pearsonr(df_validity["Depression_New"], df_validity["Anxiety"])
print(f"\n2. CONVERGENT VALIDITY")
print(f"   Correlation with anxiety (related construct):")
print(f"   r = {convergent_corr:.3f}, p = {convergent_p:.4f}")
if convergent_corr > 0.30 and convergent_p < 0.05:
    print(f"   ‚úì Good convergent validity (moderate-strong correlation)")
else:
    print(f"   ‚úó Poor convergent validity")

# Discriminant validity: Should NOT correlate with unrelated construct
discriminant_corr, discriminant_p = pearsonr(
    df_validity["Depression_New"], df_validity["Height_cm"]
)
print(f"\n3. DISCRIMINANT VALIDITY")
print(f"   Correlation with height (unrelated construct):")
print(f"   r = {discriminant_corr:.3f}, p = {discriminant_p:.4f}")
if abs(discriminant_corr) < 0.20 or discriminant_p > 0.05:
    print(f"   ‚úì Good discriminant validity (weak/no correlation)")
else:
    print(f"   ‚úó Poor discriminant validity (unexpected correlation)")

print("\n" + "=" * 60)

In [None]:
# Visualize validity evidence
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Panel 1: Criterion validity
axes[0].scatter(
    df_validity["Depression_Established"],
    df_validity["Depression_New"],
    alpha=0.5,
    s=50,
    color="#06A77D",
    edgecolors="black",
    linewidths=0.5,
)
# Add regression line
z = np.polyfit(df_validity["Depression_Established"], df_validity["Depression_New"], 1)
p = np.poly1d(z)
axes[0].plot(
    df_validity["Depression_Established"],
    p(df_validity["Depression_Established"]),
    "r-",
    linewidth=2,
    label=f"r = {criterion_corr:.3f}",
)
axes[0].set_xlabel("Established Depression Scale", fontsize=11, fontweight="bold")
axes[0].set_ylabel("New Depression Scale", fontsize=11, fontweight="bold")
axes[0].set_title("Criterion Validity\n(Should be high)", fontsize=12, fontweight="bold")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Panel 2: Convergent validity
axes[1].scatter(
    df_validity["Anxiety"],
    df_validity["Depression_New"],
    alpha=0.5,
    s=50,
    color="#F4A261",
    edgecolors="black",
    linewidths=0.5,
)
z2 = np.polyfit(df_validity["Anxiety"], df_validity["Depression_New"], 1)
p2 = np.poly1d(z2)
axes[1].plot(
    df_validity["Anxiety"],
    p2(df_validity["Anxiety"]),
    "r-",
    linewidth=2,
    label=f"r = {convergent_corr:.3f}",
)
axes[1].set_xlabel("Anxiety Scale", fontsize=11, fontweight="bold")
axes[1].set_ylabel("New Depression Scale", fontsize=11, fontweight="bold")
axes[1].set_title("Convergent Validity\n(Should be moderate)", fontsize=12, fontweight="bold")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Panel 3: Discriminant validity
axes[2].scatter(
    df_validity["Height_cm"],
    df_validity["Depression_New"],
    alpha=0.5,
    s=50,
    color="#E63946",
    edgecolors="black",
    linewidths=0.5,
)
z3 = np.polyfit(df_validity["Height_cm"], df_validity["Depression_New"], 1)
p3 = np.poly1d(z3)
axes[2].plot(
    df_validity["Height_cm"],
    p3(df_validity["Height_cm"]),
    "r-",
    linewidth=2,
    label=f"r = {discriminant_corr:.3f}",
)
axes[2].set_xlabel("Height (cm)", fontsize=11, fontweight="bold")
axes[2].set_ylabel("New Depression Scale", fontsize=11, fontweight="bold")
axes[2].set_title("Discriminant Validity\n(Should be near zero)", fontsize=12, fontweight="bold")
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("outputs/module_04/validity_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° A valid measure shows:")
print("   ‚úì High correlation with established measures (criterion validity)")
print("   ‚úì Moderate correlation with related constructs (convergent validity)")
print("   ‚úì Low correlation with unrelated constructs (discriminant validity)")

## 5. Common Survey Biases

### 1. Acquiescence Bias (Yea-Saying)
**Problem**: Tendency to agree with statements regardless of content
**Solution**: Include reverse-coded items

**Example**:
- Regular item: "I enjoy social gatherings" (Agree = extraverted)
- Reverse item: "I prefer to avoid social gatherings" (Agree = introverted)

### 2. Social Desirability Bias
**Problem**: Answering in socially acceptable ways rather than truthfully
**Solutions**:
- Guarantee anonymity
- Use indirect questioning
- Normalize sensitive behaviors

**Example**:
- Bad: "Do you ever cheat on your taxes?"
- Better: "Studies show that many people occasionally report less income than they earned. Have you ever done this?"

### 3. Recency/Primacy Effects
**Problem**: Respondents favor first or last options in a list
**Solution**: Randomize order of response options

### 4. Central Tendency Bias
**Problem**: Avoiding extreme responses; clustering around middle
**Solution**: Use forced-choice formats or remove neutral option

### 5. Demand Characteristics
**Problem**: Respondents guess study hypothesis and respond accordingly
**Solution**: Disguise true purpose; use filler items

### 6. Response Set
**Problem**: Answering all items same way without reading (speeders, straightliners)
**Solutions**:
- Include attention checks
- Vary item direction
- Flag suspicious patterns in data cleaning

In [None]:
# Detect response biases in survey data


def detect_response_biases(df):
    """
    Detect common response biases in survey data.

    Parameters:
    - df: DataFrame with survey items (rows = respondents, columns = items)

    Returns:
    - Dictionary with bias indicators
    """
    n_respondents = len(df)

    biases = {
        "straightliners": [],  # Always same response
        "speeders": [],  # Low variance (not thinking)
        "acquiescent": [],  # Always high scores
        "nay_sayers": [],  # Always low scores
    }

    for idx, row in df.iterrows():
        responses = row.values

        # Straightlining: All responses identical
        if len(np.unique(responses)) == 1:
            biases["straightliners"].append(idx)

        # Speeding: Very low variance
        if np.var(responses) < 0.5:
            biases["speeders"].append(idx)

        # Acquiescence: Mean response > 4 (on 1-5 scale)
        if np.mean(responses) > 4:
            biases["acquiescent"].append(idx)

        # Nay-saying: Mean response < 2
        if np.mean(responses) < 2:
            biases["nay_sayers"].append(idx)

    return biases


# Simulate survey data with some biased respondents
np.random.seed(111)
n_good = 180
n_bad = 20
n_items = 10

# Good respondents: Thoughtful, varied responses
good_responses = np.random.choice(
    [1, 2, 3, 4, 5], size=(n_good, n_items), p=[0.10, 0.20, 0.40, 0.20, 0.10]
)

# Bad respondents
straightliners = np.full((5, n_items), 3)  # All 3s
acquiescent = np.random.choice([4, 5], size=(5, n_items))  # All high
nay_sayers = np.random.choice([1, 2], size=(5, n_items))  # All low
speeders = np.random.choice([2, 3, 4], size=(5, n_items))  # Low variance

# Combine
all_responses = np.vstack([good_responses, straightliners, acquiescent, nay_sayers, speeders])

df_survey = pd.DataFrame(all_responses, columns=[f"Q{i+1}" for i in range(n_items)])

# Detect biases
biases_detected = detect_response_biases(df_survey)

print("=" * 60)
print("RESPONSE BIAS DETECTION")
print("=" * 60)
print(f"\nTotal respondents: {len(df_survey)}")
print(f"\nBiased response patterns detected:")
print(
    f"\n1. Straightliners (all same response): {len(biases_detected['straightliners'])} respondents"
)
print(f"   Example IDs: {biases_detected['straightliners'][:5]}")

print(f"\n2. Speeders (very low variance): {len(biases_detected['speeders'])} respondents")
print(f"   Example IDs: {biases_detected['speeders'][:5]}")

print(f"\n3. Acquiescent (always agree): {len(biases_detected['acquiescent'])} respondents")
print(f"   Example IDs: {biases_detected['acquiescent'][:5]}")

print(f"\n4. Nay-sayers (always disagree): {len(biases_detected['nay_sayers'])} respondents")
print(f"   Example IDs: {biases_detected['nay_sayers'][:5]}")

# Calculate percentage flagged
all_flagged = set(
    biases_detected["straightliners"]
    + biases_detected["speeders"]
    + biases_detected["acquiescent"]
    + biases_detected["nay_sayers"]
)
print(
    f"\nTotal unique respondents flagged: {len(all_flagged)} ({len(all_flagged)/len(df_survey)*100:.1f}%)"
)
print(f"\nüí° These respondents should be carefully reviewed and possibly excluded.")

## 6. Measurement Scales

Understanding scale types determines appropriate analyses.

### Scale Types (Stevens, 1946)

| Scale | Properties | Examples | Allowed Operations | Appropriate Statistics |
|-------|-----------|----------|-------------------|------------------------|
| **Nominal** | Categories, no order | Gender, ethnicity, country | =, ‚â† | Mode, chi-square |
| **Ordinal** | Categories, ordered, unequal intervals | Education level, Likert scales | =, ‚â†, <, > | Median, percentiles, Spearman correlation |
| **Interval** | Ordered, equal intervals, no true zero | Temperature (¬∞C), IQ scores | =, ‚â†, <, >, +, ‚àí | Mean, SD, Pearson correlation, t-test |
| **Ratio** | Ordered, equal intervals, true zero | Height, weight, income, age | =, ‚â†, <, >, +, ‚àí, √ó, √∑ | All statistics, including ratios |

### Examples

**Nominal**: Eye color  
- Blue, Brown, Green, Hazel
- Cannot say "Brown > Blue" (no inherent order)

**Ordinal**: Education  
- Less than HS < HS < Some College < Bachelor's < Graduate
- Order exists, but intervals unequal (HS‚ÜíSome College ‚â† Bachelor's‚ÜíGraduate)

**Interval**: Temperature  
- 20¬∞C to 30¬∞C = 30¬∞C to 40¬∞C (equal intervals)
- But 40¬∞C is NOT "twice as hot" as 20¬∞C (no true zero)

**Ratio**: Income  
- $0 = absolute absence of money (true zero)
- $100K is twice $50K (ratios meaningful)

### The Likert Debate

**Question**: Are Likert scales ordinal or interval?

**Technically**: Ordinal  
- Distance between "Agree" and "Strongly Agree" is not necessarily equal to distance between "Disagree" and "Neutral"

**In practice**: Often treated as interval  
- When multiple items are summed (scale scores), tends toward interval
- Robust statistics (t-tests, ANOVA) perform well even with ordinal Likert data

**Recommendation**:
- Single Likert items: Use non-parametric tests (Mann-Whitney U, Kruskal-Wallis)
- Likert scale scores (summed items): Parametric tests usually acceptable

In [None]:
# Demonstrate appropriate analyses for different scale types

np.random.seed(222)
n = 100

# Generate data for different scale types
data_demo = pd.DataFrame(
    {
        # Nominal: Gender
        "Gender": np.random.choice(["Male", "Female", "Non-binary"], n, p=[0.48, 0.48, 0.04]),
        # Ordinal: Education
        "Education": np.random.choice(
            ["High School", "Some College", "Bachelor", "Graduate"], n, p=[0.25, 0.30, 0.30, 0.15]
        ),
        # Interval: IQ (no true zero)
        "IQ": np.random.normal(100, 15, n).astype(int),
        # Ratio: Income (true zero exists)
        "Income": np.random.lognormal(10.5, 0.5, n).astype(int),
    }
)

print("Sample Data with Different Scale Types:\n")
print(data_demo.head(10))

print("\n" + "=" * 60)
print("APPROPRIATE STATISTICS BY SCALE TYPE")
print("=" * 60)

# Nominal: Mode and frequency
print("\n1. NOMINAL (Gender)")
print("   Appropriate statistics: Mode, frequencies")
print("\n   Frequency distribution:")
print(data_demo["Gender"].value_counts())
print(f"\n   Mode: {data_demo['Gender'].mode()[0]}")

# Ordinal: Median and percentiles
print("\n2. ORDINAL (Education)")
print("   Appropriate statistics: Median, percentiles")
print("\n   Frequency distribution:")
edu_order = ["High School", "Some College", "Bachelor", "Graduate"]
edu_counts = data_demo["Education"].value_counts()[edu_order]
print(edu_counts)

# Interval: Mean, SD
print("\n3. INTERVAL (IQ)")
print("   Appropriate statistics: Mean, SD, correlation, t-tests")
print(f"\n   Mean: {data_demo['IQ'].mean():.1f}")
print(f"   SD: {data_demo['IQ'].std():.1f}")
print(f"   Range: {data_demo['IQ'].min()} - {data_demo['IQ'].max()}")

# Ratio: All statistics including ratios
print("\n4. RATIO (Income)")
print("   Appropriate statistics: All statistics + ratios")
print(f"\n   Mean: ${data_demo['Income'].mean():,.0f}")
print(f"   Median: ${data_demo['Income'].median():,.0f}")
print(f"   SD: ${data_demo['Income'].std():,.0f}")
print(f"\n   Ratio example: Person with income ${data_demo['Income'].max():,}")
print(f"   earns {data_demo['Income'].max() / data_demo['Income'].median():.1f}x the median")

print("\n" + "=" * 60)

## 7. Survey Pretesting

**Never deploy a survey without pretesting!**

### Pretesting Methods

#### 1. Cognitive Interviewing
**Process**: Ask participants to "think aloud" while completing survey

**Goals**:
- Identify confusing questions
- Understand how respondents interpret questions
- Detect missing response options

**Example questions to ask**:
- "What does this question mean to you?"
- "How did you arrive at your answer?"
- "Was anything confusing or unclear?"

**Sample size**: 5-15 participants per round

#### 2. Pilot Testing
**Process**: Administer survey to small sample under realistic conditions

**Goals**:
- Test survey flow and timing
- Check skip logic and branching
- Examine response distributions
- Calculate preliminary reliability

**Sample size**: 30-50 participants (from target population)

#### 3. Expert Review
**Process**: Subject matter experts review questions

**Goals**:
- Assess content validity
- Identify biased or leading questions
- Ensure comprehensiveness

**Reviewers**: 3-5 experts in the domain

### Red Flags in Pilot Data

1. **High non-response**: Question may be sensitive or confusing
2. **All choose same option**: Question not discriminating
3. **High "Other" selection**: Missing important response categories
4. **Low reliability (Œ± < 0.70)**: Items don't cohere
5. **Unexpected patterns**: May indicate misunderstanding

### Pretesting Checklist

```
‚ñ° Conduct cognitive interviews (n ‚â• 5)
‚ñ° Revise based on feedback
‚ñ° Get expert review
‚ñ° Pilot test (n ‚â• 30)
‚ñ° Check item distributions
‚ñ° Calculate reliability (Cronbach's Œ±)
‚ñ° Analyze completion time
‚ñ° Test on multiple devices (if online)
‚ñ° Check skip logic and branching
‚ñ° Final revisions
‚ñ° Document all changes
```

In [None]:
# Create a survey development checklist

survey_checklist = pd.DataFrame(
    {
        "Stage": [
            "Planning",
            "Planning",
            "Planning",
            "Design",
            "Design",
            "Design",
            "Design",
            "Design",
            "Pretesting",
            "Pretesting",
            "Pretesting",
            "Pretesting",
            "Revision",
            "Revision",
            "Deployment",
            "Deployment",
            "Deployment",
        ],
        "Task": [
            "Define research questions and constructs",
            "Review existing validated measures",
            "Determine target population and sample size",
            "Write clear, unbiased questions",
            "Select appropriate response scales",
            "Include reverse-coded items (if applicable)",
            "Add attention checks and validity items",
            "Organize logical flow and grouping",
            "Cognitive interviews (n=5-15)",
            "Expert review (n=3-5)",
            "Pilot test (n=30-50)",
            "Calculate reliability (Cronbach's Œ±)",
            "Revise based on pretest feedback",
            "Retest if major changes made",
            "Finalize survey platform",
            "Test on multiple devices",
            "Deploy and monitor initial responses",
        ],
        "Priority": [
            "Critical",
            "High",
            "Critical",
            "Critical",
            "Critical",
            "High",
            "Medium",
            "High",
            "Critical",
            "High",
            "Critical",
            "Critical",
            "Critical",
            "High",
            "Critical",
            "High",
            "High",
        ],
    }
)

print("SURVEY DEVELOPMENT CHECKLIST")
print("=" * 80)
print(survey_checklist.to_string(index=False))

# Save checklist
survey_checklist.to_csv("outputs/module_04/survey_development_checklist.csv", index=False)
print("\n‚úì Checklist saved to outputs/module_04/survey_development_checklist.csv")

## 8. Practice Exercises

### Exercise 1: Identify Question Problems

For each question, identify the flaw(s) and suggest improvement:

1. **"Don't you think that irresponsible people shouldn't be allowed to vote?"**  
   Flaw: ___________  
   Improved: ___________

2. **"How often do you exercise and eat healthy?"**  
   Flaw: ___________  
   Improved: ___________

3. **"On a scale of 1-5, how satisfied are you with your extremely comprehensive and helpful health insurance plan?"**  
   Flaw: ___________  
   Improved: ___________

In [None]:
# Exercise 2: Calculate Cronbach's Alpha for your own data
# Create a 5-item scale measuring "Academic Motivation"

# Simulate responses from 100 students
np.random.seed(555)
n_students = 100

# Each student has true motivation level
true_motivation = np.random.normal(3, 1, n_students)

# Create 5 items (with noise)
items = {}
for i in range(1, 6):
    items[f"Item_{i}"] = true_motivation + np.random.normal(0, 0.5, n_students)
    # Convert to 1-5 Likert
    items[f"Item_{i}"] = np.clip(np.round(items[f"Item_{i}"]), 1, 5)

df_motivation = pd.DataFrame(items)

# YOUR TASK:
# 1. Calculate Cronbach's alpha
# 2. Calculate item-total correlations
# 3. Determine if any items should be removed

# YOUR CODE HERE
# alpha = cronbach_alpha(df_motivation)
# print(f"Cronbach's Alpha: {alpha:.3f}")

In [None]:
# Exercise 3: Detect biased respondents
# Use the detect_response_biases function on new data

# Generate survey data
np.random.seed(777)
test_data = np.random.choice([1, 2, 3, 4, 5], size=(150, 8))

# Add some biased respondents (you decide how)
# Hint: Create straightliners, acquiescent respondents, etc.

# YOUR CODE HERE

## 9. Summary and Key Takeaways

### The Golden Rules of Survey Design

1. **Keep it simple**: One idea per question
2. **Avoid bias**: Neutral wording, balanced options
3. **Be specific**: Define terms, specify timeframes
4. **Match scales to constructs**: Choose appropriate response formats
5. **Test psychometric properties**: Reliability and validity are non-negotiable
6. **Pretest extensively**: Cognitive interviews ‚Üí Pilot ‚Üí Revise ‚Üí Repeat
7. **Monitor data quality**: Detect and handle biased responses
8. **Document everything**: Question development, changes, decisions

### Reliability vs. Validity Decision Tree

```
Is your measure RELIABLE (consistent)?
    ‚îÇ
    NO ‚Üí Fix it! (Review items, increase length, train raters)
    ‚îÇ
    YES ‚Üí Is it VALID (accurate)?
           ‚îÇ
           NO ‚Üí It's consistently measuring the WRONG thing
           ‚îÇ     ‚îî‚îÄ> Assess construct validity, revise items
           ‚îÇ
           YES ‚Üí Good to go! (But keep monitoring)
```

### Common Mistakes to Avoid

‚ùå Skipping pretesting  
‚ùå Using double-barreled questions  
‚ùå Leading or loaded language  
‚ùå Assuming all Likert scales are reliable  
‚ùå Ignoring scale type when analyzing  
‚ùå Not checking for response biases  
‚ùå Deploying without pilot testing  

### Best Practices

‚úì Use validated measures when available  
‚úì Include reverse-coded items  
‚úì Add attention checks  
‚úì Calculate Cronbach's Œ± (target: ‚â• 0.70)  
‚úì Assess validity (convergent, discriminant, criterion)  
‚úì Screen for biased response patterns  
‚úì Report psychometric properties in publications  

### Moving Forward

Now you can create measurement instruments that produce trustworthy data. The next module covers **sampling strategies**, teaching you how to select representative samples for your surveys and studies.

## 10. Additional Resources

### Essential Readings

1. **Dillman, Smyth, & Christian (2014)**. *Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method*
   - Comprehensive guide to survey methodology

2. **Fowler (2014)**. *Survey Research Methods* (5th ed.)
   - Classic textbook on survey design

3. **DeVellis & Thorpe (2021)**. *Scale Development: Theory and Applications* (5th ed.)
   - The definitive guide to creating measurement scales

4. **Tourangeau, Rips, & Rasinski (2000)**. *The Psychology of Survey Response*
   - Understanding cognitive processes in survey taking

### Online Resources

- **American Association for Public Opinion Research (AAPOR)**: Best practices and ethics
- **Questionnaire Design Tips** (Pew Research): Practical guidelines
- **Cognitive Interviewing Guide** (Centers for Disease Control): Free manual

### Tools and Software

- **Survey Platforms**: Qualtrics, SurveyMonkey, Google Forms, LimeSurvey (open-source)
- **Reliability Calculators**: SPSS, R (psych package), Python (pingouin)
- **Validated Scales**: APA PsycTests, National Cancer Institute Grid-Enabled Measures Database

---

## Congratulations!

You've completed **Module 04: Survey Design & Measurement**. You can now:

‚úì Design effective, unbiased survey questions  
‚úì Select appropriate response scales  
‚úì Evaluate reliability using Cronbach's alpha  
‚úì Assess validity (criterion, convergent, discriminant)  
‚úì Identify and mitigate survey biases  
‚úì Understand measurement scale types  
‚úì Conduct comprehensive survey pretesting  
‚úì Detect problematic response patterns  

**Next Module**: Sampling Strategies  
**File**: `05_sampling_strategies.ipynb`

---