# Encoding Races into a Vector Space
Suppose we have some arbitrarily long list of races, represented as a string, and we'd like to embed all races into a vector space. Further assuming that all races provided are fundamental and any non-fundamental race can
be represented as some linear combination of the fundamental races.

In [6]:
import numpy as np

In [35]:
RACES = (
    "American Indian",
    "Asian",
    "Black",
    "Hispanic",
    "Native Hawaiin",
    "White"
)
RACES_INDICES = { race: ndx for (ndx, race) in enumerate(RACES) }

## Onward
In practice, you might look for a more flexible way of creating the representation of races above, but the idea is that given a list of races, we can create the indices associated with each race uniquely and seamlessly. Here, we provide a way for constructing vector's of races by providing a 2D array where each element is the individuals race and "how much" of that race they are. Obviously, the sum total of "how much" of each race an individual is must sum to unity.

In [42]:
def createVectorForRaces(races_and_pcts):
    """
    Creates a vector of length size(RACES_INDICES) and assigns "how much" of each race an individual is a
    vector, based on the indices in RACES_INDICES
    
    Args:
        races_and_pcts: An array of 2-tuples representing race and percentage of said race [(race, pct_race), ...]
        pct_race must sum to unity!
        
    Returns:
        A vector of length size(RACES_INDICES)
    """
    racesVector = np.zeros(len(RACES), dtype=np.float16) # My choice of variable name here is full of jokes I won't elaborate on :)
    
    # Ensures our constraint is maintained
    isUnity = abs(sum(map(lambda rap : rap[1], races_and_pcts)) - 1.0)
    assert isUnity < 1e-1
    
    for (race, pct) in races_and_pcts:
        rndx = RACES_INDICES[race] # may want to check before hand that the races provided exist, but let's assume they do
        racesVector[rndx] = pct
    
    return racesVector

In [43]:
createVectorForRaces([
    ("White", .5),
    ("Black", .5)
])

array([0. , 0. , 0.5, 0. , 0. , 0.5], dtype=float16)

## What's next?
Now that we've created a consistent encoder of race information to vectors, we can use standard similarity measures
([inner products](https://en.wikipedia.org/wiki/Inner_product_space)) to see how similar two individuals are, based strictly on their race.

In [45]:
blackPerson = createVectorForRaces([("Black", 1.)])
whitePerson = createVectorForRaces([("White", 1.)])
abwPerson = createVectorForRaces([
    ("Black", .25),
    ("White", .15),
    ("Asian", .6)
])
abwPersonEven = createVectorForRaces([
    ("Black", .33),
    ("White", .33),
    ("Asian", .33)
])
print(f"How similar, ethnically speaking, is a black person to a white person? ==> {np.dot(blackPerson, whitePerson)}")
print(f"How similar, ethnically speaking, is a black person to an evenly distributed (asian, white, black) person? ==> {np.dot(blackPerson, abwPersonEven)}")
print(f"How similar, ethnically speaking, is a black person to a non-evenly distributed (asian, white, black) person? ==> {np.dot(blackPerson, abwPerson)}")

How similar, ethnically speaking, is a black person to a white person? ==> 0.0
How similar, ethnically speaking, is a black person to an evenly distributed (asian, white, black) person? ==> 0.330078125
How similar, ethnically speaking, is a black person to a non-evenly distributed (asian, white, black) person? ==> 0.25


## What does this mean?
Of course, further analysis could be done, but what the above is suggesting is that someone of mixed race can have varying similarities to
someone of an exact race. This is intuitive since we're dealing with vectors here and looking at norms in Euclidean space, so we're imbuing
these numerical results with sociological meaning. Whether such meaning makes sense depends on the application of choice and subsequent
results, but the following serves as a minimal example of encoding races into a vector space.