# BATS 2023: Understanding Race/Ethnicity Question Design Issues

## Purpose

This notebook analyzes free-text responses in the `race_other` field to understand **how survey participants feel about the race/ethnicity questions** and identify opportunities to improve question design.

**Key Questions:**
- What identities are people writing in that don't fit the provided checkboxes?
- Are people confused about the difference between race and ethnicity?
- What patterns suggest the current categories are inadequate?
- What feedback can inform better question design for future surveys?

## Background

The BATS 2023 survey asks about race using a "select all that apply" format with these checkboxes:
- `race_1`: White
- `race_2`: Black or African American  
- `race_3`: Asian
- `race_4`: American Indian or Alaska Native
- `race_5`: Native Hawaiian or Pacific Islander
- `race_997`: Other race (with free-text field `race_other`)
- `race_999`: Prefer not to answer

This follows standard US Census categories, but the free-text responses suggest some participants don't identify with these options.

**This analysis is NOT about mapping responses to existing categories** - it's about understanding what's missing or confusing about the current question structure.

## 1. Setup and Data Loading

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import nltk
import pandas as pd
import numpy as np
nltk.download('punkt')
nltk.download('punkt_tab')
from collections import Counter
from nltk import bigrams
from pathlib import Path
from typing import Dict, List

# Data paths
DATA_DIR = Path(r"C:\Box\Modeling and Surveys\Surveys\Travel Diary Survey\BATS_2023\Versioned_Data\PreWeight_PreLink_MonToSun_20250610")
DATASET_GUIDE = "bats_dataset_guide.html"

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\schildress\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\schildress\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
person_file = DATA_DIR / "person.csv"

person_df = pd.read_csv(person_file)
person_df.shape

(17188, 207)

## 2. Initial Exploration

In [4]:
# Check race-related columns
race_cols = [col for col in person_df.columns if 'race' in col.lower()]
print("Race-related columns:")
for col in race_cols:
    print(f"  {col}")

Race-related columns:
  race_1
  race_2
  race_3
  race_4
  race_5
  race_997
  race_999
  race_other


In [5]:
# Examine race_other field
print(f"Total persons: {len(person_df)}")
print(f"race_other not null: {person_df['race_other'].notna().sum()}")
print(f"race_other is null: {person_df['race_other'].isna().sum()}")
print(f"\nPercentage with race_other: {person_df['race_other'].notna().sum() / len(person_df) * 100:.2f}%")

Total persons: 17188
race_other not null: 469
race_other is null: 16719

Percentage with race_other: 2.73%


In [6]:
# Check if race_997 (Other race) was selected
# Note: Need to investigate the coding scheme first!
if 'race_997' in person_df.columns:
    # Count non-null, non-zero values (assuming checked = some non-zero value)
    race_997_checked = person_df['race_997'].notna() & (person_df['race_997'] != 0)
    print(f"race_997 appears to be checked (non-zero): {race_997_checked.sum()}")
    
    # Check alignment with race_other field
    has_race_other = person_df['race_other'].notna()
    print(f"\nrace_997 checked AND race_other filled: {(race_997_checked & has_race_other).sum()}")
    print(f"race_997 checked BUT race_other empty: {(race_997_checked & ~has_race_other).sum()}")
    print(f"race_997 not checked BUT race_other filled: {(~race_997_checked & has_race_other).sum()}")

race_997 appears to be checked (non-zero): 5151

race_997 checked AND race_other filled: 469
race_997 checked BUT race_other empty: 4682
race_997 not checked BUT race_other filled: 0


In [7]:
# Investigate race_997 coding - what values does it actually contain?
if 'race_997' in person_df.columns:
    print("Unique values in race_997:")
    print(person_df['race_997'].value_counts(dropna=False).sort_index())
    print(f"\nData type: {person_df['race_997'].dtype}")
    print(f"Min: {person_df['race_997'].min()}")
    print(f"Max: {person_df['race_997'].max()}")
    print(f"\nMissing values: {person_df['race_997'].isna().sum()}")

Unique values in race_997:
race_997
0      12037
1        469
995     4682
Name: count, dtype: int64

Data type: int64
Min: 0
Max: 995

Missing values: 0


In [8]:
# Check all race columns to understand the coding scheme
race_checkbox_cols = ['race_1', 'race_2', 'race_3', 'race_4', 'race_5', 'race_997', 'race_999']
available_cols = [col for col in race_checkbox_cols if col in person_df.columns]

print("Sample values from race columns (first 10 rows):")
print(person_df[available_cols].head(10))

print("\n" + "="*50)
print("Value distributions for each race column:")
print("="*50)
for col in available_cols:
    print(f"\n{col}:")
    print(person_df[col].value_counts(dropna=False).head())

Sample values from race columns (first 10 rows):
   race_1  race_2  race_3  race_4  race_5  race_997  race_999
0       0       0       0       0       1         0         0
1       0       0       0       0       0         1         0
2     995     995     995     995     995       995       995
3       0       0       1       0       0         0         0
4     995     995     995     995     995       995       995
5     995     995     995     995     995       995       995
6       0       0       0       0       0         0         1
7       0       0       0       0       0         0         1
8       0       0       0       0       1         0         0
9     995     995     995     995     995       995       995

Value distributions for each race column:

race_1:
race_1
0      11950
995     4682
1        556
Name: count, dtype: int64

race_2:
race_2
0      12339
995     4682
1        167
Name: count, dtype: int64

race_3:
race_3
0      8368
995    4682
1      4138
Name: count,

In [9]:
# What ethnicity/Hispanic columns exist in the dataset?
hisp_cols = [col for col in person_df.columns if 'hisp' in col.lower() or 'ethn' in col.lower()]
print(f"Found {len(hisp_cols)} Hispanic/ethnicity columns:")
for col in sorted(hisp_cols):
    print(f"  {col}")

Found 7 Hispanic/ethnicity columns:
  ethnicity_1
  ethnicity_2
  ethnicity_3
  ethnicity_4
  ethnicity_997
  ethnicity_999
  ethnicity_other


## Key Finding: Ethnicity Questions Exist Separately

The survey DOES ask about ethnicity/Hispanic origin separately from race. However, many respondents still wrote "Hispanic" or "Latino" in the race_other field. This suggests:

1. **People might not be reaching the ethnicity question** (question ordering issue)
2. **The distinction between race and ethnicity is unclear** (question design issue)  
3. **Hispanic/Latino identity feels more salient than race categories** for many respondents

Let's investigate what ethnicity categories are offered and how often they're used.

In [10]:
# Show distribution of ethnicity checkboxes (same 0/1/995 coding as race)
print("Ethnicity checkbox distributions:")
print("="*50)
for col in ['ethnicity_1', 'ethnicity_2', 'ethnicity_3', 'ethnicity_4', 'ethnicity_997', 'ethnicity_999']:
    if col in person_df.columns:
        checked = (person_df[col] == 1).sum()
        total = (person_df[col].notna() & (person_df[col] != 995)).sum()
        print(f"\n{col}: {checked} selected out of {total} ({checked/total*100:.1f}%)")
        
print("\n" + "="*50)
print(f"\nethnicity_other responses: {person_df['ethnicity_other'].notna().sum()}")

Ethnicity checkbox distributions:

ethnicity_1: 10258 selected out of 12506 (82.0%)

ethnicity_2: 705 selected out of 12506 (5.6%)

ethnicity_3: 72 selected out of 12506 (0.6%)

ethnicity_4: 37 selected out of 12506 (0.3%)

ethnicity_997: 441 selected out of 12506 (3.5%)

ethnicity_999: 1060 selected out of 12506 (8.5%)


ethnicity_other responses: 440


In [11]:
# What did people write in ethnicity_other?
print("Top 30 ethnicity_other responses:")
print(person_df['ethnicity_other'].value_counts().head(30))

Top 30 ethnicity_other responses:
ethnicity_other
Spanish              13
Colombian            10
Spanish               9
Spain                 7
peruvian              6
Nicaraguan            6
Peruvian              6
Colombian             6
El Salvador           5
Guatemalan            5
Spain                 5
Brazilian             4
Peruvian              4
Portuguese            4
Nicaragua             4
Argentina             3
Argentinian           3
Brazilian             3
South American        3
Central America       3
Nicaraguan            3
Peruvian\n            3
Central American      3
El Salvador           3
Salvadoran            3
spanish               2
Chilean               2
Costa Rican           2
Panamanian            2
Salvadoran            2
Name: count, dtype: int64


## Survey Design Analysis: What's Not Working?

Let's analyze **survey design issues** revealed by the free-text responses in race_other.

In [17]:
# Categorize race_other responses by theme to identify missing categories
def categorize_race_other(text):
    """Categorize free-text responses by identity theme"""
    if pd.isna(text):
        return None
    
    text_lower = str(text).lower().strip()
    
    # Hispanic/Latino
    hispanic_keywords = ['hispanic', 'latino', 'latina', 'latinx', 'mexican', 'puerto rican',
                          'salvadoran', 'guatemalan', 'colombian', 'peruvian', 'nicaraguan',
                          'latin', 'mestizo', 'chicano', 'central american', 'south american']
    if any(kw in text_lower for kw in hispanic_keywords):
        return 'Hispanic/Latino'
    
    # Middle Eastern / North African (MENA)
    mena_keywords = ['middle eastern', 'middle east', 'arab', 'persian', 'iranian', 'lebanese',
                      'egyptian', 'moroccan', 'turkish', 'armenian', 'syrian', 'iraqi', 'jewish']
    if any(kw in text_lower for kw in mena_keywords):
        return 'Middle Eastern / North African'
    
    # Multiracial / Mixed
    mixed_keywords = ['mixed', 'multi', 'biracial', 'hapa', 'half']
    if any(kw in text_lower for kw in mixed_keywords):
        return 'Multiracial / Mixed'
    
    # South Asian (often written as "Indian" which Census classifies as Asian but people may not identify that way)
    south_asian_keywords = ['indian', 'india', 'pakistani', 'bangladesh', 'sri lanka', 'nepal',
                             'south asian', 'desi']
    if any(kw in text_lower for kw in south_asian_keywords):
        return 'South Asian'
    
    # Protest / Philosophical responses
    protest_keywords = ['human', 'american', 'person', 'prefer not', 'none', 'n/a', 'na', 'no']
    if any(kw in text_lower for kw in protest_keywords):
        return 'Protest / Decline to state'
    
    # European nationality (confusion between nationality and race)
    european_keywords = ['european', 'italian', 'irish', 'german', 'french', 'english', 
                          'scottish', 'polish', 'russian', 'scandinavian', 'caucasian']
    if any(kw in text_lower for kw in european_keywords):
        return 'European ancestry / White'
    
    return 'Other / Unclear'



### 2. Middle Eastern / North African (MENA) - Not Represented

**Finding:** 53 people (11% of race_other) wrote Middle Eastern or North African identities.

**The Problem:** US Census classifies MENA as "White," but many MENA individuals don't identify that way. The current categories force them to:
- Check "White" (doesn't feel accurate)
- Check "Other race" and write in (what they did)
- Skip the question

**Examples:** Middle Eastern, Arab, Persian, Iranian, Jewish (ethnic), Lebanese, Egyptian, Turkish, Armenian

**Recommendation:** Consider adding "Middle Eastern or North African" as a distinct race category. This has been proposed for the US Census but not yet adopted. For Bay Area surveys, MENA is a significant population that deserves representation.

---

### 3. Multiracial / Mixed Identity - Limited Options

**Finding:** 35 people (7% wrote) "mixed," "multiracial," "biracial," or similar terms.

**The problem:** While the survey allows "select all that apply," some people want to explicitly identify as mixed/multiracial rather than checking multiple boxes. The free-text lets them express this.

**Examples:** mixed, mixed race, biracial, hapa, half and half

**Recommendation:** Consider whether "Multiracial" should be its own checkbox option, or if "select all that apply" sufficiently captures these identities. Ask mixed-race people how they prefer to answer.

## Summary: Recommendations for Survey Design

Based on 469 free-text responses in race_other, here are the key issues and recommendations:

### Top 3 Issues

**1. Hispanic/Latino Identity (46% of responses)**
- Problem: Respondents don't identify with standard race categories (White, Black, Asian)  
- Current approach: Separate race + ethnicity questions following US Census
- Consider: Offering "Hispanic/Latino" as a race option, or using combined race/ethnicity question
- Test: Ask Hispanic respondents how they prefer to identify

**2. Middle Eastern / North African - Missing Category (11%)**
- Problem: MENA individuals are classified as "White" but many don't identify that way
- Consider: Adding "Middle Eastern or North African" as distinct race category
- Note: This is increasingly recognized as a needed category (proposed for US Census)

**3. Asian Subcategories - Too Broad (6% specifically South Asian)**
- Problem: "Asian" encompasses East Asian, Southeast Asian, South Asian - very different identities
- Consider: Disaggregating into subcategories, especially given Bay Area's large Asian population
- Note: Providing specificity shows respect for community diversity

### Secondary Issues

**Multiracial Identity (7%):** Some want explicit "Multiracial/Mixed" option beyond "select all"

**Conceptual Confusion:** Many conflate race, ethnicity, nationality, ancestry - clearer guidance needed

**Protest Responses:** 11% declined to state in race_other (vs. 10% who checked "Prefer not to answer") - suggests discomfort with question

### Process Recommendations

1. **User Testing:** Conduct cognitive interviews with diverse respondents about how they interpret the race/ethnicity questions
2. **Question Ordering:** Test whether asking ethnicity before or after race affects response patterns  
3. **International Context:** Consider whether US Census categories are appropriate for diverse Bay Area population

In [None]:
# Create summary table for reporting
summary_df = race_other_df['category'].value_counts().reset_index()
summary_df.columns = ['Identity Category', 'Count']
summary_df['Percentage of race_other'] = (summary_df['Count'] / len(race_other_df) * 100).round(1)
summary_df['Percentage of all respondents'] = (summary_df['Count'] / 12506 * 100).round(2)

print("\nSUMMARY: Identity Categories Missing from Current Race Question")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)
print(f"\nTotal race_other responses: {len(race_other_df)}")
print(f"Total survey respondents: 12,506")
print(f"Percent using race_other field: {len(race_other_df)/12506*100:.1f}%")


SUMMARY: Identity Categories Missing from Current Race Question
             Identity Category  Count  Percentage of race_other  Percentage of all respondents
               Hispanic/Latino    218                      46.5                           1.74
               Other / Unclear     64                      13.6                           0.51
Middle Eastern / North African     53                      11.3                           0.42
    Protest / Decline to state     50                      10.7                           0.40
           Multiracial / Mixed     35                       7.5                           0.28
                   South Asian     28                       6.0                           0.22
     European ancestry / White     21                       4.5                           0.17

Total race_other responses: 469
Total survey respondents: 12,506
Percent using race_other field: 3.8%


### 4. South Asian Identity - "Asian" Feels Too Broad

**Finding:** 28 people (6%) specifically wrote "Indian," "South Asian," or "Desi."

**The Problem:** While Census classifies all Asians together, people from South Asia (Indian subcontinent) may not identify with or feel represented by the broad "Asian" category, which often connotes East Asian in American English.

**Examples:** Indian, Asian Indian, South Asian, Desi

**Recommendation:** Consider disaggregating "Asian" into subcategories:
- East Asian (Chinese, Japanese, Korean, etc.)
- Southeast Asian (Vietnamese, Filipino, Thai, etc.)  
- South Asian (Indian, Pakistani, Bangladeshi, etc.)

Or at minimum, communicate clearly that "Asian" includes all of Asia.

---

### 5. Confusion: Nationality vs. Race vs. Ethnicity

**Finding:** Many responses show confusion between nationality (Mexican, Colombian), ethnicity (Hispanic, Latino), and race constructs.

**Examples:**
- Writing nationalities: "Mexican," "Colombian," "Guatemalan," "Peruvian"
- Writing ancestries: "European," "Italian," "German," "Russian"
- Writing religions: "Jewish," "Ashkenazi"

**The Problem:** The race/ethnicity distinction is a social construct that doesn't map cleanly onto how people understand their own identity. Many people think in terms of national origin, not abstract racial categories.

**Recommendation:** 
- Provide clearer guidance about what "race" means in the survey context
- Consider whether national origin / ancestry is actually more useful for your analysis
- Test alternative question framings: "How do you describe your racial or ethnic identity?"

In [None]:
# Show examples from each category
print("EXAMPLES BY CATEGORY")
print("="*80)

for category in race_other_df['category'].value_counts().index:
    examples = race_other_df[race_other_df['category'] == category]['race_other'].head(8).tolist()
    print(f"\n{category} ({race_other_df[race_other_df['category'] == category].shape[0]} responses):")
    for ex in examples:
        print(f"  • {ex}")

EXAMPLES BY CATEGORY

Hispanic/Latino (218 responses):
  • Mexican

  • Latina
  • Mexican American and Swedish American
  • latino
  • Mexicano
  • Mexican-American 
  • Hispanic
  • Hispanic

Other / Unclear (64 responses):
  • brown
  • Guatemala
  • i'm a bit of a hybrid
  • 亚裔
  • other
  • othet
  • h
  • asian

Middle Eastern / North African (53 responses):
  • Middle Eastern
  • Jewish/Middle Eastern
  • Southwest Asian / Arab
  • middle eastern 
  • Middle eastern 
  • middle eastern
  • Middle Eastern
  • middle eastern

Protest / Decline to state (50 responses):
  • European American
  • North African 
  • West European, Native American.
  • Native American
  • Human race
  • ashkenazi
  • white / native american
  • filipino, salvadorean 

Multiracial / Mixed (35 responses):
  • mixed races
  • mixed race
  • mixto , multiracial 
  • mixed race
  • mixed
  • I am half salvadorian, a quarter Japanese, and a quarter English. "English" stands for Scottish,Irish, and German.
  

In [None]:
# Create subset with race_other responses
race_other_df = person_df[person_df['race_other'].notna()].copy()

print(f"Persons with race_other: {len(race_other_df)}")
print(f"Unique responses: {race_other_df['race_other'].nunique()}")

Persons with race_other: 469
Unique responses: 300


In [None]:
# Most common raw responses
print("Top 30 most common race_other responses:")
race_other_df['race_other'].value_counts().head(30)

Top 30 most common race_other responses:


race_other
Hispanic             18
Hispanic             15
latino               13
Latino               12
Middle Eastern       10
hispanic             10
Mexican               9
mexican               8
Latino                7
mixed                 7
Latina                6
middle eastern        6
Mexican\n             5
Mixed                 5
indian                4
Puerto Rican          4
mixed race            4
Jewish                4
latin                 3
middle eastern        3
Mexican               3
Latinx                3
mixed                 3
Hispanic\n            3
Latin                 3
Mestizo               3
Mexicano              2
European American     2
Native American       2
Caucasian             2
Name: count, dtype: int64

## 3. Text Cleaning and Analysis

In [None]:
# Clean text for analysis
race_other_df['text_clean'] = (
    race_other_df['race_other']
    .str.lower()
    .str.strip()
    .str.replace(r'[^\w\s]', '', regex=True)
)

race_other_df['text_clean'].value_counts().head(20)

text_clean
hispanic                         48
latino                           38
mexican                          27
middle eastern                   23
mixed                            17
indian                            9
latina                            8
latin                             8
jewish                            6
mixed race                        5
puerto rican                      4
asian indian                      4
mexican american                  4
mestizo                           4
mexicano                          3
brown                             3
latinx                            3
asian indian pacific islander     3
chinese                           3
hispano                           3
Name: count, dtype: int64

In [None]:
# Tokenize
race_other_df['tokens'] = race_other_df['text_clean'].apply(nltk.word_tokenize)

# Single word analysis
all_words = []
for token_list in race_other_df['tokens']:
    all_words.extend(token_list)

word_counts = Counter(all_words)
print("Top 30 most common words:")
pd.DataFrame(word_counts.most_common(30), columns=['word', 'count'])

Top 30 most common words:


Unnamed: 0,word,count
0,hispanic,62
1,mexican,56
2,latino,49
3,mixed,38
4,and,35
5,american,32
6,eastern,28
7,middle,27
8,indian,24
9,asian,21


In [None]:
# Bigram analysis
all_bigrams = []
for token_list in race_other_df['tokens']:
    all_bigrams.extend(list(bigrams(token_list)))

bigram_counts = Counter(all_bigrams)
print("Top 20 two-word phrases:")
pd.DataFrame(bigram_counts.most_common(20), columns=['bigram', 'count'])

Top 20 two-word phrases:


Unnamed: 0,bigram,count
0,"(middle, eastern)",25
1,"(i, am)",12
2,"(mixed, race)",11
3,"(puerto, rican)",8
4,"(mexican, american)",7
5,"(asian, indian)",7
6,"(native, american)",6
7,"(mexican, and)",4
8,"(pacific, islander)",4
9,"(my, dad)",3


## 4. Language Detection

In [None]:
# Check for non-ASCII characters
def has_non_ascii(text):
    if pd.isna(text):
        return False
    try:
        text.encode('ascii')
        return False
    except UnicodeEncodeError:
        return True

race_other_df['has_non_ascii'] = race_other_df['race_other'].apply(has_non_ascii)

print(f"Responses with non-ASCII characters: {race_other_df['has_non_ascii'].sum()}")
print(f"Percentage: {race_other_df['has_non_ascii'].sum() / len(race_other_df) * 100:.1f}%")

Responses with non-ASCII characters: 3
Percentage: 0.6%


In [None]:
# Show non-ASCII examples
non_ascii_responses = race_other_df[race_other_df['has_non_ascii']]
if len(non_ascii_responses) > 0:
    print(f"\nExamples of non-ASCII responses ({len(non_ascii_responses)} total):")
    non_ascii_responses['race_other'].value_counts().head(20)


Examples of non-ASCII responses (3 total):


race_other
亚裔                                    1
Indígena Mexica                       1
mexicano (nativo y español blanco)    1
Name: count, dtype: int64

## 5. Multi-racial Response Detection

In [None]:
# Check if respondents also selected other race checkboxes
race_checkbox_cols = ['race_1', 'race_2', 'race_3', 'race_4', 'race_5', 'race_997', 'race_999']
available_race_cols = [col for col in race_checkbox_cols if col in race_other_df.columns]

if available_race_cols:
    # Count how many race boxes were checked
    race_other_df['num_races_selected'] = race_other_df[available_race_cols].sum(axis=1)
    
    print("Distribution of number of race categories selected:")
    print(race_other_df['num_races_selected'].value_counts().sort_index())
    
    print(f"\nRespondents who selected multiple race categories: {(race_other_df['num_races_selected'] > 1).sum()}")
    print(f"Percentage: {(race_other_df['num_races_selected'] > 1).sum() / len(race_other_df) * 100:.1f}%")

Distribution of number of race categories selected:
num_races_selected
1    372
2     79
3     12
4      4
6      2
Name: count, dtype: int64

Respondents who selected multiple race categories: 97
Percentage: 20.7%


In [None]:
# Detect multi-racial keywords in free text
multi_racial_keywords = ['and', 'half', 'mixed', 'multiracial', 'multi racial', 'biracial', 'bi racial', '/', '-']

def contains_multiracial_language(text):
    if pd.isna(text):
        return False
    text_lower = str(text).lower()
    return any(keyword in text_lower for keyword in multi_racial_keywords)

race_other_df['has_multiracial_text'] = race_other_df['race_other'].apply(contains_multiracial_language)

print(f"Responses with multi-racial language: {race_other_df['has_multiracial_text'].sum()}")
print(f"Percentage: {race_other_df['has_multiracial_text'].sum() / len(race_other_df) * 100:.1f}%")

# Examples
if race_other_df['has_multiracial_text'].sum() > 0:
    print("\nExamples of multi-racial text:")
    print(race_other_df[race_other_df['has_multiracial_text']]['race_other'].value_counts().head(20))

Responses with multi-racial language: 105
Percentage: 22.4%

Examples of multi-racial text:
race_other
mixed                                                                                                                                                                                                                                                                7
Mixed                                                                                                                                                                                                                                                                5
mixed race                                                                                                                                                                                                                                                           4
mixed                                                                                                       

## Conclusion: What We Learned About Survey Design

This analysis examined 469 free-text responses in the race_other field to understand how well the current race/ethnicity questions serve respondents. **The goal was NOT to map responses to existing categories, but to identify what's missing or confusing about the current question design.**

### Key Findings

1. **Hispanic/Latino identity is the #1 issue (46% of responses)**
   - Many Hispanic respondents don't identify with the provided race categories
   - Even those who correctly answer the ethnicity question still write "Hispanic" in race_other
   - The two-question format (race + ethnicity) doesn't match how people experience their identity

2. **Middle Eastern / North African is not represented (11% of responses)**
   - MENA individuals are classified as "White" by US Census but many don't identify that way
   - This is a significant Bay Area population that deserves distinct representation

3. **"Asian" is too broad (6% wrote specific South Asian identity)**
   - Combining East Asian, Southeast Asian, and South Asian into one category doesn't reflect community diversity
   - Consider disaggregating, especially for a region with large Asian population

### What to Do Next

**Short-term:** Accept that the current categories don't serve everyone well. Use this analysis to inform data interpretation and reporting.

**Long-term:** Consider redesigning the race/ethnicity questions for future surveys:
- Test combined race/ethnicity questions vs. two separate questions  
- Add MENA category
- Disaggregate Asian categories
- Conduct cognitive interviews with diverse respondents about what question formats work best

**Most important:** Recognize that race/ethnicity are social constructs that don't map cleanly onto how everyone understands their identity. The "wrong" answers in free-text fields are actually valuable feedback about question design.