In [9]:
from cities.queries.fips_query import FipsQuery

# Similarity in Polis (conceptual overview)

This notebook provides a conceptual explanation of the functionalities offered by the [Polis API](http://polis.basis.ai/similarity.html). You can explore the practical use of these functionalities in the accompanying notebook, `simmilarity_demo.ipynb`.

The similarity functionality involves calculating similarity using either a single variable or a set of variables with specified weights, potentially with some discounting of the past when it comes to time series. This process is applicable at both county and Metropolitan Statistical Area (MSA) levels.
  

#### Standardization

Values that are to be compared are standardized using the `standardize_and_scale` function available in `cleaning_utils.py`. The standardization process follows two steps. Firstly, Z-score standardization is applied for a variable $X$ with a mean $\mu$ and standard deviation $\sigma$:

$$ Z = \frac{{X - \mu}}{{\sigma}} $$

Following this, a sigmoid function is applied, mapping the standardized input to the range [-1, 1]:

$$ \text{{sigmoid}}(x) = \frac{1}{{1 + \exp(-x)}} $$








#### Distance Calculation

The comparison function calculates differences between corresponding elements of `vector1` and `vector2` using squared Euclidean distance. This distance is adjusted based on specified weights, which are determined by the level of significance (ranging from 0 to 4 in the frontend, with the option of reversing and using dissimilarity and negative levels of significance between -1, 1).

When a single variable is used, the level of significance becomes less critical since it is the sole variable in the comparison setting, and there is no comparison of weights. The distance calculation is carried out by the `find_euclidean_kins` function embedded in the `FipsQuery` class (or `MSAFipsQuery` for MSA level), located in `fips_query.py`. 

The distance formula used is as follows:

$$ d(\mathbf{v}_1, \mathbf{v}_2) = \sqrt{\sum_{i=1}^{n} w_i \cdot (v_{1i} - v_{2i})^2} $$

Here, $ n $ is the number of dimensions (features), $ w_i $ is the weight associated with dimension $ i $, and $ v_{1i} $ and $ v_{2i} $ are the corresponding values in vectors $ \mathbf{v}_1 $ and $ \mathbf{v}_2 $.

This calculation allows for the customization of distances based on the importance assigned to each feature through the weights. The function `find_euclidean_kins` then uses this distance calculation to compare the specified location to all other locations in the dataset.

For weight calculation for time series, we use a time decay argument. The weight assigned to a given time series in general is divided between the available years, but time decay adjusts the weights of the variables, assigning higher weight to more recent years. The default coefficient is set to $1.08$, and the decay is exponential. If set to $1$, all available years would be assigned equal weights.





In the case of multiple variables, the distances are combined to represent the overall similarity of a county. This combination is determined by the weights used in the calculation of the distances. Mathematically, if $d_i$ represents the distance for the $i$-th variable, the overall distance ($D$) is calculated as follows:

$$ D = \sqrt{\sum_{i=1}^{n} d_i^2} $$

The value $D$ is utilized to represent the similarity of the county. On our website, these values are visualized on a map using a color gradient. Additionally, on the right-hand side, there is a ranking of the $20$ most similar counties.

The visualization below illustrates the weights assigned to variables in the calculation of similarity for Adams County, Colorado. Both the population variable and GDP are time series. Their weights are $4$ and $2$, respectively, indicating that GDP contributes less to the overall distance measure. Additionally, the weights for these variables do not have a uniform distribution, more recent years are considered more relevant. This adjustment is achieved through the mentioned time decay argument.

In [12]:
f = FipsQuery(42001, 'gdp', feature_groups_with_weights={"population": 4, "gdp": 2})
f.find_euclidean_kins()
f.plot_weights()

The second visualization displays the top $5$ distances that are closest to Adams County in terms of the specified variables and their corresponding weights mentioned above.

In [13]:
fig = f.show_kins_plot()