![POLITICO](https://rawgithub.com/The-Politico/src/master/images/logo/badge.png)

# POLITICO demographic district similarity maps

POLITICO partisan voting district similarity maps align districts by their similarity based on demographic stats from the U.S. Census.

The maps are created by calculating a weighted Euclidean distance between districts based on their demographic characteristics.

#### Maps? I don't see any maps.

And you won't. "Maps" mean lists of similar districts, one list for each district.

---

## The data

The demographic profile for each district is based on these characteristics:

- Non-hispanic white percent of population (ACS table B03002)
- Non-hispanic black percent of population (ACS table B03002)
- Non-hispanic asian percent of population (ACS table B03002)
- Hispanic percent of population (ACS table B03002)
- Median Age (B01002)
- Median Income (B19013)
- Education attainment (some college or greater) (B15003)
- Population density

---

## Calculate weights

We weight demographic variables by how significant they are in shaping a district's political identity. To do that, we statistically measure the relationship between the variables and voting behavior using a linear model.

While our goal is to determine the similarity of districts, we actually use county-level data to calculate our weights. Because there are more counties than districts, counties give us a more robust picture of the relationship between demographics and voting.

Our voting data is 2012 and 2016 presidential results. Our demographic data comes from the 2012 and 2016 5-year American Community Survey from the U.S. Census. (In the linear model, we weight recent data more heavily.)

To calculate the weights for demographics, we need to estimate how influential each variable is compared to each other in terms of its impact on party margin.

To begin we normalize our demographic variables to a scale of 0 to 1 representing the min and max of each variable's distribution. That way we can compare their model coefficients. We then take the absolute value of the coefficients. That ratio represents how influential they are compared to each other when used together to predict voting behavior. These are our weights.

For example, if non-hispanic whiteness has a coefficient of -4 while income has a coefficent of 1 in the model, we weight whiteness 4 to 1 when calculating the Euclidean distance between districts.

### Get county census data



In [1]:
import os
from census import Census
from us import states

c = Census(os.getenv('CENSUS_API_KEY'))


def get_census_series(year):
    FIPS = {}
    
    def add_to_fips(d):
        if d["fips"] not in FIPS:
            FIPS[d["fips"]] = {}
        FIPS[d["fips"]] = {**FIPS[d["fips"]], **d}
    
    def normalize_values(label, collection):
        values = [d[label] for d in collection]
        max_value = max(values)
        min_value = min(values)
        for d in collection:
            d[label + "_norm"] = (d[label] - min_value) / (max_value - min_value)
            add_to_fips(d)
        
    
    # WHITE
    
    white = [{
        "fips": w["state"] + w["county"],
        "year": str(year),
        "white": w["B03002_003E"] / w["B03002_001E"]
    } for w in c.acs5.get(['B03002_003E', 'B03002_001E'], {'for': 'county:*'}, year=year)]
    
    normalize_values("white", white)
    
    # Black
    
    black = [{
        "fips": b["state"] + b["county"],
        "year": str(year),
        "black": b["B03002_004E"] / b["B03002_001E"]
    } for b in c.acs5.get(['B03002_004E', 'B03002_001E'], {'for': 'county:*'}, year=year)]
    
    normalize_values("black", black)
    
    
    # Hispanic
    
    hispanic = [{
        "fips": h["state"] + h["county"],
        "year": str(year),
        "hispanic": h["B03002_012E"] / h["B03002_001E"]
    } for h in c.acs5.get(['B03002_012E', 'B03002_001E'], {'for': 'county:*'}, year=year)]
    
    normalize_values("hispanic", hispanic)
    
    
    # Asian
    
    asian = [{
        "fips": h["state"] + h["county"],
        "year": str(year),
        "asian": h["B03002_006E"] / h["B03002_001E"]
    } for h in c.acs5.get(['B03002_006E', 'B03002_001E'], {'for': 'county:*'}, year=year)]
    
    normalize_values("asian", asian)
    

    
    # AGE
    
    age = [{
        "fips": a["state"] + a["county"],
        "year": str(year),
        "age": a["B01002_001E"]
    } for a in c.acs5.get('B01002_001E', {'for': 'county:*'}, year=year)]
    
    normalize_values("age", age)

    
    # INCOME
    
    income = [{
        "fips": i["state"] + i["county"],
        "year": str(year),
        "income": i["B19013_001E"]
    } for i in c.acs5.get('B19013_001E', {'for': 'county:*'}, year=year)]
    
    normalize_values("income", income)
    
    
    # EDUCATION

    education = [{
        "fips": e["state"] + e["county"],
        "year": str(year),
        "education": (
            e["B15003_019E"] +
            e["B15003_020E"] +
            e["B15003_021E"] +
            e["B15003_022E"] +
            e["B15003_023E"] +
            e["B15003_024E"] +
            e["B15003_025E"]
        ) / e["B15003_001E"],
    } for e in c.acs5.get([
        'B15003_001E',
        'B15003_019E',
        'B15003_020E',
        'B15003_021E',
        'B15003_022E',
        'B15003_023E',
        'B15003_024E',
        'B15003_025E'
    ], {'for': 'county:*'}, year=year)]
    
    normalize_values("education", education)
    
    
    # DENSITY

    POP = {}
    for d in c.acs5.get(['B01003_001E'], {'for': 'county:*'}, year=year):
        fips = d["state"] + d["county"]
        POP[fips] = int(d["B01003_001E"])
    
    density = [
        {
            "fips": d["state"] + d["county"],
            "density": POP[d["state"] + d["county"]] / int(d["AREALAND"])
        } for d in c.sf1.get('AREALAND', {'for': 'county:*'}, year=2010) if d["state"] + d["county"] in POP
    ]
    
    normalize_values("density", density)
    
    return FIPS


CENSUS_2016 = get_census_series(2016)
CENSUS_2012 = get_census_series(2012)

### Get county election data

In [2]:
import requests
import io
import csv


def get_results_series(year):
    response = requests.get('https://raw.githubusercontent.com/The-Politico/presidential-county-data/master/output/{}.csv'.format(year))

    results = {}
    reader = csv.DictReader(io.StringIO(response.text))
    for row in reader:
        fips = row['county_fips']
        results[fips] = {
            "dem": int(row['democrat']),
            "dem_pct": int(row["democrat"]) / int(row["total"]),
            "gop": int(row['republican']),
            "gop_pct": int(row["republican"]) / int(row["total"]),
            "total": int(row['total']),
            "margin": (int(row["democrat"]) / int(row["total"])) - (int(row["republican"]) / int(row["total"]))
        }
    return results

RESULTS_2012 = get_results_series(2012)
RESULTS_2016 = get_results_series(2016)


### Fit the model

In [3]:
from sklearn.linear_model import LinearRegression
from IPython.display import display

X = []
y = []
w = []


def add_to_model(DEMOGRAPHICS, VOTING, WEIGHT):
    for fips, demos in DEMOGRAPHICS.items():
        if fips not in VOTING:
            continue
        y.append(VOTING[fips]["margin"])
        X.append([
            demos["white_norm"],
            demos["black_norm"],
            demos["hispanic_norm"],
            demos["asian_norm"],
            demos["age_norm"],
            demos["income_norm"],
            demos["education_norm"],
            demos["density_norm"],
        ])
        w.append(WEIGHT)

add_to_model(CENSUS_2016, RESULTS_2016, 2)
add_to_model(CENSUS_2012, RESULTS_2012, 1)

model = LinearRegression().fit(X, y, w)

  linalg.lstsq(X, y)


### R<sup>2</sup>

In [4]:
display(model.score(X, y, w))

0.4548465203054187

### Coefficients

In [5]:
coefficients = {
    'white': model.coef_[0],
    'black': model.coef_[1],
    'hispanic': model.coef_[2],
    'asian': model.coef_[3],
    'age': model.coef_[4],
    'income': model.coef_[5],
    'education': model.coef_[6],
    'density': model.coef_[7]
}
display(coefficients)

{'white': -0.6925013874350867,
 'black': 0.356000358828075,
 'hispanic': -0.24205959230946833,
 'asian': 0.9221360218062666,
 'age': 0.11822108301928899,
 'income': -0.18014078676637332,
 'education': 0.645780266460257,
 'density': 1.0368261642579768}

### Weights

We use the absolute value of the coefficients as our weights because we don't care about the _direction_ of the effect just the _size_ of it relative to the other variables.

In [6]:
WEIGHTS = {
    "white": abs(coefficients["white"]),
    "black": abs(coefficients["black"]),
    "hispanic": abs(coefficients["hispanic"]),
    "asian": abs(coefficients["asian"]),
    "age": abs(coefficients["age"]),
    "income": abs(coefficients["income"]),
    "education": abs(coefficients["education"]),
    "density": abs(coefficients["density"])
}

---

## Calculate district similarity

### Get district census data

In [7]:
import us

def get_district_census_series(year):
    DISTRICTS = {}
    
    def add_to_fips(d):
        if d["district"] not in DISTRICTS:
            DISTRICTS[d["district"]] = {}
        DISTRICTS[d["district"]] = {**DISTRICTS[d["district"]], **d}
    
    def normalize_values(label, collection):
        values = [d[label] for d in collection]
        max_value = max(values)
        min_value = min(values)
        for d in collection:
            d[label + "_norm"] = (d[label] - min_value) / (max_value - min_value)
            add_to_fips(d)
    
    def district_id(d):
        state_postal = us.states.lookup(d["state"]).abbr
        district = d["congressional district"]
        return "{}-{}".format(state_postal, district)
        
    
    # WHITE
    
    white = [{
        "district": district_id(w),
        "year": str(year),
        "white": w["B03002_003E"] / w["B03002_001E"]
    } for w in c.acs5.get(['B03002_003E', 'B03002_001E'], {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("white", white)
    
    
    # Black
    
    black = [{
        "district": district_id(b),
        "year": str(year),
        "black": b["B03002_004E"] / b["B03002_001E"]
    } for b in c.acs5.get(['B03002_004E', 'B03002_001E'], {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("black", black)
    
    
    # Hispanic
    
    hispanic = [{
        "district": district_id(h),
        "year": str(year),
        "hispanic": h["B03002_012E"] / h["B03002_001E"]
    } for h in c.acs5.get(['B03002_012E', 'B03002_001E'], {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("hispanic", hispanic)
    
    
    # Asian
    
    asian = [{
        "district": district_id(h),
        "year": str(year),
        "asian": h["B03002_006E"] / h["B03002_001E"]
    } for h in c.acs5.get(['B03002_006E', 'B03002_001E'], {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("asian", asian)



    
    # AGE
    
    age = [{
        "district": district_id(a),
        "year": str(year),
        "age": a["B01002_001E"]
    } for a in c.acs5.get('B01002_001E', {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("age", age)

    
    # INCOME
    
    income = [{
        "district": district_id(i),
        "year": str(year),
        "income": i["B19013_001E"]
    } for i in c.acs5.get('B19013_001E', {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("income", income)
    
    
    # EDUCATION

    education = [{
        "district": district_id(e),
        "year": str(year),
        "education": (
            e["B15003_019E"] +
            e["B15003_020E"] +
            e["B15003_021E"] +
            e["B15003_022E"] +
            e["B15003_023E"] +
            e["B15003_024E"] +
            e["B15003_025E"]
        ) / e["B15003_001E"],
    } for e in c.acs5.get([
        'B15003_001E',
        'B15003_019E',
        'B15003_020E',
        'B15003_021E',
        'B15003_022E',
        'B15003_023E',
        'B15003_024E',
        'B15003_025E'
    ], {'for': 'congressional district:*'}, year=year)]
    
    normalize_values("education", education)
    
    
    # DENSITY

    POP = {}
    
    for d in c.acs5.get(['B01003_001E'], {'for': 'congressional district:*'}, year=year):
        district = district_id(d)
        POP[district] = int(d["B01003_001E"])
    
    response = requests.get('https://www2.census.gov/geo/relfiles/cdsld16/natl/natl_landarea_cd_delim.txt')
    text = "\n".join(response.text.split("\n")[1:])
    
    # Manually add at-large districts which aren't included in the geo data
    # We get land area from here:
    # https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml
    text += "02,00,570640.95\n" # Alaska
    text += "10,00,1948.54\n" # Delaware
    text += "30,00,145545.80\n" # Montana
    text += "38,00,69000.80\n" # North Dakota
    text += "46,00,75811.00\n" # South Dakota
    text += "50,00,9216.66\n" # Vermont
    text += "56,00,97093.14\n" # Wyoming
    
    reader = csv.DictReader(io.StringIO(text))
    
    def alt_id(d):
        return "{}-{}".format(
            us.states.lookup(d["State"]).abbr,
            d["Congressional District"]
        )
    
    density = [
        {
            "district": alt_id(d),
            "density": POP[alt_id(d)] / float(d["Land Area"])
        } for d in reader
    ]
    
    normalize_values("density", density)
    
    return DISTRICTS


CENSUS_DISTRICTS = get_district_census_series(2016)

### Weighted Euclidean distance

In [8]:
from scipy.spatial import distance

def get_distance(districtDemos, comparitorDemos):
    w = [
        WEIGHTS["white"],
        WEIGHTS["black"],
        WEIGHTS["hispanic"],
        WEIGHTS["asian"],
        WEIGHTS["age"],
        WEIGHTS["income"],
        WEIGHTS["education"],
        WEIGHTS["density"]
    ]
    district = [
        districtDemos["white_norm"],
        districtDemos["black_norm"],
        districtDemos["hispanic_norm"],
        districtDemos["asian_norm"],
        districtDemos["age_norm"],
        districtDemos["income_norm"],
        districtDemos["education_norm"],
        districtDemos["density_norm"],
    ]
    comparitor = [
        comparitorDemos["white_norm"],
        comparitorDemos["black_norm"],
        comparitorDemos["hispanic_norm"],
        comparitorDemos["asian_norm"],
        comparitorDemos["age_norm"],
        comparitorDemos["income_norm"],
        comparitorDemos["education_norm"],
        comparitorDemos["density_norm"],
    ]
    return distance.euclidean(district, comparitor, w)





district_distances = {}

exclude = ['DC-98', 'PR-98']

districts = [d for d in CENSUS_DISTRICTS.keys() if d not in exclude]

for district, districtDemos in CENSUS_DISTRICTS.items():
    if district in exclude:
        continue
    district_distances[district] = []
    for comparitor, comparitorDemos in CENSUS_DISTRICTS.items():
        if district == comparitor or comparitor in exclude:
            continue
        district_distances[district].append({
            'district': comparitor,
            'distance': get_distance(districtDemos, comparitorDemos)
        })

For each district we calculate the **22** (~5%) most similar districts.

In [9]:
N = 22

In [10]:
similar_district_ids = {}
similar_districts = {}

for district, distances in district_distances.items():
    sorted_distances = list(sorted(distances, key=lambda k: k['distance']))
    similar_district_ids[district] = [k['district'] for k in sorted_distances[:N]]
    similar_districts[district] = sorted_distances[:N]

---

## Output

JSON

In [11]:
import json

with open('data/demographic-similarity.json', 'w') as file:
    json.dump(similar_district_ids, file)

CSV with similarity score stats

In [12]:
with open('data/demographic-similarity.csv','w') as file:
    writer = csv.writer(file)
    writer.writerow(['district', 'min_similarity', 'max_similarity', 'similarity_range', 'most_similar ⬅'])

    for district in districts:
        MIN = similar_districts[district][0]['distance']
        MAX = similar_districts[district][-1]['distance']
        row = [district, MIN, MAX, MAX - MIN] + [k['district'] for k in similar_districts[district]]
        
        writer.writerow(row)
        
        print(row[0], row[4:14])

AL-01 ['SC-05', 'MS-01', 'AL-02', 'AL-03', 'GA-10', 'SC-07', 'MS-04', 'GA-03', 'NC-06', 'MO-05']
AL-02 ['AL-01', 'GA-08', 'SC-05', 'SC-07', 'MS-01', 'LA-04', 'GA-10', 'AL-03', 'MS-03', 'GA-12']
AL-03 ['MS-01', 'SC-05', 'GA-10', 'AL-01', 'MS-04', 'AL-02', 'SC-07', 'GA-03', 'LA-03', 'NC-06']
AL-04 ['KY-01', 'PA-17', 'OH-04', 'TN-06', 'VA-09', 'KY-02', 'IN-02', 'MO-08', 'OH-13', 'PA-11']
AL-05 ['OH-10', 'IL-12', 'SC-04', 'FL-01', 'LA-01', 'TN-08', 'NC-03', 'FL-03', 'OH-01', 'AR-02']
AL-06 ['FL-01', 'OH-10', 'MI-03', 'NY-20', 'NY-25', 'AL-05', 'MI-09', 'MI-12', 'IL-13', 'NY-24']
AL-07 ['MS-02', 'LA-02', 'SC-06', 'MI-13', 'TN-09', 'GA-02', 'OH-11', 'IL-02', 'FL-05', 'GA-13']
AK-00 ['WA-10', 'WA-08', 'TX-12', 'MA-03', 'TX-25', 'NY-18', 'WA-02', 'TX-31', 'CO-06', 'IL-10']
AZ-01 ['CA-23', 'CA-08', 'OK-05', 'NM-03', 'CA-50', 'TX-17', 'TX-19', 'IL-03', 'WA-04', 'NV-04']
AZ-02 ['FL-07', 'TX-21', 'AZ-08', 'CO-01', 'TX-25', 'AZ-09', 'CO-07', 'CA-24', 'NV-02', 'CO-03']
AZ-03 ['TX-35', 'TX-23', 'CA-4

### Compare demographic profiles of top 5 similar districts

See how close the demographic profiles of the most similar districts line up.

In [13]:
def get_measures(district):
    return [
        CENSUS_DISTRICTS[district]['white_norm'],
        CENSUS_DISTRICTS[district]['black_norm'],
        CENSUS_DISTRICTS[district]['hispanic_norm'],
        CENSUS_DISTRICTS[district]['asian_norm'],
        CENSUS_DISTRICTS[district]['age_norm'],
        CENSUS_DISTRICTS[district]['income_norm'],
        CENSUS_DISTRICTS[district]['education_norm'],
        CENSUS_DISTRICTS[district]['density_norm']
    ]
    

with open('data/demographic-similarity-top-5.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow([
        'district',
        'white',
        'black',
        'hispanic',
        'asian',
        'age',
        'income',
        'education',
        'density'
    ])
    
    for district in districts:
        district_measures = get_measures(district)
        similars = []
        writer.writerow([])
        writer.writerow([district] + district_measures)
        top5 = similar_districts[district][:5]
        for d in top5:
            similar_measures = get_measures(d['district'])
            similars.append(similar_measures)
            writer.writerow([d['district']] + similar_measures)
        variances = ['VARIANCE',]
        for m in district_measures:
            i = district_measures.index(m)
            variance = 0
            for s in similars:
                variance += (m - s[i])
            variances.append(variance)
        writer.writerow(variances)

### Compare similarity maps to POLITICO race ratings

These maps list the ratings of similar districts. We also calculate the variance for each map based on a point scale for the ratings.

In [14]:
from statistics import pvariance

response = requests.get('https://www.politico.com/election-results/2018/race-ratings/data/ratings.json')
ratings = {}
rating_codes = {}

for rating in response.json():
    ratings[rating['id']] = rating['latest_rating']['short_label']
    rating_codes[ratings[rating['id']]] = rating['latest_rating']['order']

    
with open('data/demographic-similarity-with-ratings.csv','w') as file:
    writer = csv.writer(file)
    writer.writerow(['district', 'rating', 'variance', 'most_similar_ratings'])
    
    for district in districts:
        row = [
            district,
            ratings[district],
            pvariance([rating_codes[ratings[k['district']]] for k in similar_districts[district]])
        ] + [ratings[k['district']] for k in similar_districts[district]]
        
        writer.writerow(row)
        
        print(row[0:2], row[3:8])

['AL-01', 'Solid-R'] ['Solid-R', 'Solid-R', 'Solid-R', 'Solid-R', 'Solid-R']
['AL-02', 'Solid-R'] ['Solid-R', 'Solid-R', 'Solid-R', 'Solid-R', 'Solid-R']
['AL-03', 'Solid-R'] ['Solid-R', 'Solid-R', 'Solid-R', 'Solid-R', 'Solid-R']
['AL-04', 'Solid-R'] ['Solid-R', 'Likely-D', 'Solid-R', 'Solid-R', 'Solid-R']
['AL-05', 'Solid-R'] ['Likely-R', 'Lean-R', 'Solid-R', 'Solid-R', 'Solid-R']
['AL-06', 'Solid-R'] ['Solid-R', 'Likely-R', 'Solid-R', 'Solid-D', 'Solid-D']
['AL-07', 'Solid-D'] ['Solid-D', 'Solid-D', 'Solid-D', 'Solid-D', 'Solid-D']
['AK-00', 'Likely-R'] ['Solid-D', 'Lean-D', 'Solid-R', 'Solid-D', 'Solid-R']
['AZ-01', 'Lean-D'] ['Solid-R', 'Solid-R', 'Likely-R', 'Solid-D', 'Lean-R']
['AZ-02', 'Toss-Up'] ['Likely-D', 'Likely-R', 'Likely-R', 'Solid-D', 'Solid-R']
['AZ-03', 'Solid-D'] ['Solid-D', 'Lean-R', 'Solid-D', 'Lean-R', 'Solid-D']
['AZ-04', 'Solid-R'] ['Solid-R', 'Likely-R', 'Solid-D', 'Solid-R', 'Likely-R']
['AZ-05', 'Solid-R'] ['Solid-D', 'Likely-R', 'Toss-Up', 'Solid-D', 'Soli