# Political Boundaries and Simulations

## Zoë Farmer

# What are we talking about?

In this presentation I'll discuss two numerical approaches to solving political boundaries while striving to avoid gerrymandering.

Slides, code, and images are available here if you want to follow along.

http://bit.ly/2xvaHnX

or

http://dataleek.io/presentations/politicalboundaries

# Who am I?

* My name is Zoë Farmer
* I'm a recent CU graduate with a BS in Applied Math and a CS Minor
* I'm a co-coordinator of the Boulder Python Meetup
* I'm a big fan of open source software
* I'm looking for work (pls hire me)
* http://www.dataleek.io
* [@thedataleek](http://www.twitter.com/thedataleek)
* [git(hub|lab).com/thedataleek](http://github.com/thedataleek)

## Our Algorithms

* [Simulated Annealing](https://en.wikipedia.org/wiki/Simulated_annealing)
    * Minor permutations in a given solution until we find a solution that's slightly better, and repeat.
* [Genetic Algorithm](https://en.wikipedia.org/wiki/Genetic_algorithm)
    * Create a ton of random solutions, have them "combine" and take the best of their children.

# Simulated Annealing

1. Generate a random solution
2. Generate a "neighboring solution" to our generated solution
3. Keep whichever is better, or (with decaying probability) take the new one regardless
4. Go back to 2

# Genetic Algorithm

1. Randomly generate an initial population of solutions
2. Use our solution population to generate some large number of children (note,
   these children should inherit properties from their parents)
3. Keep the best of our total population
4. Go back to 2

# Drawing Political District Boundaries

How can this be applied to political boundaries?

Assumptions:

* 2 parties
* Rectangular areas
* Provided in a specific format

```
D R D R D R R R
D D R D R R R R
D D D R R R R R
D D R R R R D R
R R D D D R R R
R D D D D D R R
R R R D D D D D
D D D D D D R D
```

Which can be plotted for readability.

<img src="./img/smallState_initial.png" style="width: 50%">

# Related Problems to Solve First

## Finding Neighbors of a Point

Our first big problem is how we find neighbors of a single point. For any `(y, x)` pair we can express its neighbors using the following algorithm.

1. Iterate over range(-1, 2) for both x and y
2. For each loop, accept (y + yi, x + xi) if the following conditions hold:
    * y + yi is within the range of the field
    * x + xi is within our domain of the field
    * xi and yi are not both equal to zero

## Determining if a District is Valid

How do we determine valid solutions?

* Think of these as [single connected components](https://en.wikipedia.org/wiki/Connected_component_(graph_theory))
* We can use [connected component labelling](https://en.wikipedia.org/wiki/Connected-component_labeling). (from wikipedia)

## Finding District Neighbors

* Need to find all neighbors of a given district
* Super similar to connected component labelling

The basic algorithm is as follows.

1. Get a random spot inside the given district
2. Add this spot to a Queue
3. Initialize an empty labelling array (as with connected component labelling)
4. While the queue is not empty, get an new `(y, x)` pair.
5. If the point falls within the district, get all of the point's neighbors, add them to the queue, and go back to (4)
6. If the point does not fall into the district, add it to the list of district neighbors.

## What is a Fitness Function?

For both these algorithms we talk about their "value", which in this case is determined with a fitness function.

From wikipedia:

> A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims.

TL;DR a single number that basically tells us how "good" of a solution we have.

Taking a step back from the code and considering the real world, let's think about what we'd ideally like to emphasize in a political districting system.

* We'd want districts to be homogeneous.
* We want our district ratios to approximately match our population ratios.
* We'd want to avoid [gerrymandering](https://en.wikipedia.org/wiki/Gerrymandering)
<img src="./img/gerrymandering_example.jpg" style="width: 400px; height: 300px;"/>

* We want all districts to be around the same population size.

Translated to our code these priorities become

1. Validity of solution
2. Make sure the ratio of `R` to `D` majority districts matches the ratio of `R` to `D` in the general population.
3. Make sure each district is as homogeneous as possible
4. Reduce the value of the district if its size isn't close to the "ideal size", which is `total_size / num_districts`.
5. We also take into account that in non-homogeneous districts voters that aren't affiliated with the majority party might be swayed by targeted campaigns. To this effect we account each non-affiliated "zone" with a weight of -0.9 instead of -1.
6. Finally, we can also minimize edge length as well as trying to keep each district the same size. This will result in hopefully ideal districts

## Generating Random Solutions

This algorithm is very straightforward.

1. Generate a number of "spawn points" equal to the number of districts.
2. Fill.

The fill algorithm is also straightforward.

1. Set a list of available districts.
2. While there are any non-set points, pick a random district, `i`, from the list of available districts.
3. Get a list of all neighbors of the district, but filter to only 0-valued entries.
4. If no such neighbors exist, remove this district from the list of available districts.
5. Otherwise pick a neighbor at random and set it to `i`.
6. Loop back to (2).

<img src="./img/generate_random_solution_smallState.gif" style="width: 50%;">

# Simulated Annealing

Recall:

1. Generate a random solution
2. Generate a solution neighbor
3. If the new solution is better than the old, set the current solution to the new one.
4. Sometimes accept a worse solution

## How much does choice of `k` impact solution selection?

<img src="./img/kvals.png" style="width: 50%;">

The entire process looks like this:

<img src="./img/simulated_annealing_solution_smallState.gif" style="width: 50%">

Which has the following final solution.

<img src="./img/simulated_annealing_solution_smallState.png" style="width: 50%">

## Mutations

Simulated Annealing relies on "mutating" solutions via the following algorithm.

1. Find all district neighbors
2. Pick a neighboring point at random.
3. If the neighboring point's district has at least size 2, set this neighboring point to our district.
4. Otherwise, pick a different neighboring point.

Which can be visualized as follows.

<img src="./img/mutation.png" style="width: 50%">

# Genetic Algorithms

As simulated annealing relies on `mutate()` to narrow down on a good solution, the genetic algorithm relies on `combine()` to take two solutions and generate a "child" solution.

1. Shuffle our two parents in an array.
2. Shuffle a list of districts.
3. Set a cursor that points to the first parent in the array.
4. Iterate through our districts with variable `i`
5. For the current district, find all points of the parent that our cursor is pointing to.
6. Get all "open" (i.e. set to 0) points for our child solution
7. For every point that matches between these two sets, make a new bitmask.
8. If this bitmask is valid (i.e. one connected component), set all point in this child solution to our current district
9. Otherwise, make the district valid and set the bits in the child solution
10. Flip the cursor

The algorithm behind making a district valid is easy, if we have more than one
connected component in a given district, pick one at random and discard the
other connected components.

Which can be visualized as follows.

<img src="./img/combine.png" style="width: 50%">

# Final Thoughts

* They're both unique approaches that can be applied to incredibly complex problems
* However much of their success hinges on the effectiveness of your fitness function
* Any given "final solution" is somewhat unique, or at the very least improbable to obtain again

# Using the Code

This is straightforward! After installing the required libraries (check [the repository](https://gitlab.com/thedataleek/politicalboundaries)) just run

```bash
$ python3.6 ./politicalboundaries.py $FILE_TO_RUN
```

If you want to dig a little deeper, use the `-h` flag to see what it can do, but
here's a short list as well.

* Use Simulated Annealing on the file
* Use the Genetic Algorithm on the file
* Set the number of districts for either solution type
* Set the precision (number of runs) for either algorithm
* Animate the solution process
* Create gifs of the solution process (otherwise just `.mp4`)
* Generate report (`README.md`) assets.
* Do all of the above in one go.

# Next Steps

I want to do more for this project but I'm limited in the time I have. I do have
a couple of ideas for next steps however.

* Parallelizing - Instead of just running simulations on a single thread, we could theoretically spin up a bunch of different threads and run simulations on them simultaneously, only keeping the best of all trials.
* Real Data - It would be amazing to take the approaches used in this writeup and apply it to real-world political data.

# Questions?