In [1]:
# set up for continuous integration
# can be safely ignored by the user
import os

smoke_test = "CI" in os.environ
smoke_test = True
sample_size = 10 if smoke_test else 100

from cities.queries.fips_query import FipsQuery
# proper imports
from cities.utils.data_grabber import (DataGrabber, list_available_features,
                                       list_interventions, list_outcomes,
                                       list_tensed_features)

#### How to use `FipsQuery` to ask similarity questions


The code chunk below illustrates a basic use of `FipsQuery` class for comparing counties in terms of specified parameters and features.
The simplest use is when we use `fips` codes to identify counties and potentially an `outcome variable` as a key variable of interest. Without running any similarity comparison, we can then just compare our recent performance to a random sample of other counties in the US. `range_multiplier` just regulates the range of the `y` axis.

In [2]:
f = FipsQuery(42001, "gdp")
f.compare_my_outcome_to_others(sample_size=sample_size, range_multiplier=10)

We can also specify feature class weights to variables (such as `gdp`, or `population`) or sets of variables (such as `ethnic composition` or `urbanization`). Feature weights, ranging from $-4$ to $4$, are assigned to influence the impact of each feature (or class of features) on the final similarity score, where negative values represent us actually interested in dissimilarity in a given aspect. Once we instantiate a query, we run the calculations and inspect the results. 

The plot represents weights assigned to the specified variables. Notice that they are not uniformly distributed across the years (in case of time series), they intentionally assign higher weights to more recent years. The decay is exponential and can be regulated (if set to `1`, all available years will be given equal weights). 

In [3]:
f = FipsQuery(
    42001,
    "gdp",
    feature_groups_with_weights={"gdp": 1, "population": 2, "ethnic_composition": 3},
)

f.find_euclidean_kins()
f.plot_weights()
display(f.euclidean_kins.head())

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,cuban_ethnic_composition,other_hispanic_latino_ethnic_composition,white_ethnic_composition,black_african_american_ethnic_composition,american_indian_alaska_native_ethnic_composition,asian_ethnic_composition,native_hawaiian_other_pacific_islander_ethnic_composition,other_race_races_ethnic_composition,distance to 42001,percentile
0,42001,"Adams, PA",78.619,84.689,84.475,85.86,89.556,93.508,93.154,95.379,...,0.001282,0.007294,0.441782,0.006836,0.000361,0.003772,2.4e-05,0.010845,0.0,35.04
1,55111,"Sauk, WI",82.875,84.592,88.181,95.186,97.049,95.672,94.922,93.953,...,0.00016,0.004616,0.448913,0.004004,0.005724,0.002415,0.0,0.010867,0.08931,50.75
2,29213,"Taney, MO",83.682,84.756,85.324,83.514,88.412,89.617,95.733,95.833,...,0.000116,0.007869,0.439593,0.008782,0.002515,0.003151,4.5e-05,0.01443,0.090979,43.43
3,36077,"Otsego, NY",84.619,85.798,89.384,94.284,95.18,97.756,94.764,98.616,...,0.000619,0.007973,0.451995,0.008787,0.000729,0.006675,0.000534,0.011993,0.092976,43.2
4,55131,"Washington, WI",76.23,78.707,82.806,85.619,89.292,91.108,93.105,92.335,...,0.000404,0.002373,0.460007,0.006636,0.000756,0.00708,4e-05,0.008485,0.093832,58.62


You can simply visualize the top n best matches by calling `show_kins_plot()`. `n` is to be specified when you initiate a query, its default value is 5. An example with a different `n` will be shown later.

In [4]:
fig = f.show_kins_plot()

When we instantiate a `FipsQuery`, we can specify more parameters:

- `1ag` - time lag for comparing outcomes with historical data (which places `lag` years in the past were most similar to the specified location as it is now?) 
- `top` - the number of top locations to consider in the comparisons
- `time_decay` - adjusts the weight decay over time in the generalized Euclidean distance calculation. If set to `1`, all years are given equal value. Values above one discount the importance of older years, while values below one increase the importance of older years. Warning: the decay is exponential, so small parameter changes may have large effects, always plot the weights as a sanity check!
- `outcome_comparison_period` - specifies the years to consider for the outcome comparison (can be used only when `lag=0`)
- `outcome_percentile_range` - defines a percentile range for filtering locations based on the most recent value of the outcome variable


We can also investigate the results of the `find_euclidean_kins()` by displaying: a ranking of locations based on their similarity to the specified location (`f.euclidean_kins`), the weighted contributions of each feature to the comparison (`f.featurewise_contributions`), and aggregated over feature groups and normalized feature contributions (`f.aggregated_featurewise_contributions`).

In [5]:
f = FipsQuery(
    42001,
    "gdp",
    feature_groups_with_weights={"gdp": 1, "population": 2, "ethnic_composition": 3},
    lag=0,
    top=5,
    time_decay=1.06,
    outcome_comparison_period=(2003, 2019),
    outcome_percentile_range=(40, 100),
)
f.find_euclidean_kins()

# you just want the resulting ranking with the original features
display(f.euclidean_kins.head())

# you want to inspect weighted contributions of each feature
display(f.featurewise_contributions.head())

# you want to aggregate these across feature and normalize
# numbers now mean: "percentage of contribution to the distance"
display(f.aggregated_featurewise_contributions.head())

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,cuban_ethnic_composition,other_hispanic_latino_ethnic_composition,white_ethnic_composition,black_african_american_ethnic_composition,american_indian_alaska_native_ethnic_composition,asian_ethnic_composition,native_hawaiian_other_pacific_islander_ethnic_composition,other_race_races_ethnic_composition,distance to 42001,percentile
0,42001,"Adams, PA",78.619,84.689,84.475,85.86,89.556,93.508,93.154,95.379,...,0.001282,0.007294,0.441782,0.006836,0.000361,0.003772,2.4e-05,0.010845,0.0,35.04
1,55131,"Washington, WI",76.23,78.707,82.806,85.619,89.292,91.108,93.105,92.335,...,0.000404,0.002373,0.460007,0.006636,0.000756,0.00708,4e-05,0.008485,0.090042,58.62
2,55111,"Sauk, WI",82.875,84.592,88.181,95.186,97.049,95.672,94.922,93.953,...,0.00016,0.004616,0.448913,0.004004,0.005724,0.002415,0.0,0.010867,0.091508,50.75
3,29213,"Taney, MO",83.682,84.756,85.324,83.514,88.412,89.617,95.733,95.833,...,0.000116,0.007869,0.439593,0.008782,0.002515,0.003151,4.5e-05,0.01443,0.094434,43.43
4,37089,"Henderson, NC",88.835,92.642,91.296,93.431,96.665,102.28,97.338,97.962,...,0.001111,0.011703,0.41235,0.016244,0.000917,0.006734,0.0,0.011469,0.095611,57.19


Unnamed: 0,GeoFIPS,GeoName,2003_gdp,2004_gdp,2005_gdp,2006_gdp,2007_gdp,2008_gdp,2009_gdp,2010_gdp,...,puerto_rican_ethnic_composition,cuban_ethnic_composition,other_hispanic_latino_ethnic_composition,white_ethnic_composition,black_african_american_ethnic_composition,american_indian_alaska_native_ethnic_composition,asian_ethnic_composition,native_hawaiian_other_pacific_islander_ethnic_composition,other_race_races_ethnic_composition,distance to 42001
3044,55129.0,"Washburn, WI",0.005748,0.000875,0.00099,0.009389,0.000217,0.015559,0.039459,0.017005,...,0.047921,0.019464,0.025067,0.016246,0.000256,0.000941,0.021832,0.000796,0.022858,0.090042
3034,55109.0,"St. Croix, WI",0.012786,0.03393,0.028158,0.008474,0.007844,0.007293,0.016927,0.000185,...,0.05466,0.024848,0.013656,0.006371,0.003617,0.012802,0.008946,0.001179,0.000214,0.091508
1571,29213.0,"Taney, MO",0.002927,0.008507,0.004291,0.015215,0.011443,0.002323,0.018741,0.024738,...,0.041995,0.025825,0.002932,0.001959,0.002487,0.005141,0.004096,0.001012,0.034946,0.094434
1917,37089.0,"Henderson, NC",0.023543,0.027542,0.026714,0.034339,0.018569,0.013219,0.022718,0.018446,...,0.027551,0.00377,0.022499,0.026442,0.012041,0.001325,0.019545,0.001179,0.006073,0.095611
2245,42037.0,"Columbia, PA",0.011598,0.0154,0.00938,0.000568,0.011217,0.016991,0.008262,0.014254,...,0.053297,0.022504,0.029068,0.025933,0.000988,0.000297,0.006143,0.002877,0.002435,0.095854


Unnamed: 0,GeoFIPS,GeoName,distance to 42001,gdp,population,ethnic_composition
3044,55129.0,"Washburn, WI",0.090042,0.337936,0.267116,0.394948
3034,55109.0,"St. Croix, WI",0.091508,0.386295,0.331599,0.282106
1571,29213.0,"Taney, MO",0.094434,0.382932,0.381597,0.23547
1917,37089.0,"Henderson, NC",0.095611,0.630087,0.053231,0.316683
2245,42037.0,"Columbia, PA",0.095854,0.490954,0.104616,0.40443


## Use Cases


#### Use case: no outcome, just features

This scenario exemplifies not passing an argument for the outcome variable. The plot displays weights assigned to the variables in given years. The weights are not uniformly assigned across the years to emphasize patterns from the most recent time.

In [6]:
# You don't want to pass outcome and are interested in similarities
# but you also want to see what features are available

# you may list available features
print(list_available_features())

# you may notice that `industry` is available, as well as particular `industry_` features. Let's illustrate why:
print(list_tensed_features())

# `industry_` are particular time series, whereas `industry` does not occur on the tensed_features list. You can inspect
# the head of the `industry` dataset:

dg = DataGrabber()

# each datasets come in four flavors raw/normalized and wide/long. Let's load and inspect the raw wide version for `industry`:

dg.get_features_wide(["industry"])
dg.wide["industry"].head()

# these are tenseless data, whose columns are explained in the `data_sources.ipynb` notebook.

# you can also look at other lists

print(list_interventions())
print(list_outcomes())  # these will be time series without the interventions

# let's move on with our example.

['ethnic_composition', 'gdp', 'industry', 'industry_accommodation_food_services_total', 'industry_admin_support_services_total', 'industry_agriculture_total', 'industry_arts_recreation_total', 'industry_construction_total', 'industry_educational_services_total', 'industry_finance_insurance_total', 'industry_healthcare_social_services_total', 'industry_information_total', 'industry_management_enterprises_total', 'industry_manufacturing_total', 'industry_mining_total', 'industry_other_services_total', 'industry_professional_services_total', 'industry_public_administration_total', 'industry_real_estate_total', 'industry_retail_trade_total', 'industry_transportation_warehousing_total', 'industry_utilities_total', 'industry_wholesale_trade_total', 'medianHouseholdIncome', 'population', 'povertyAll', 'povertyAllprct', 'povertyUnder18', 'povertyUnder18prct', 'spending_HHS', 'spending_commerce', 'spending_transportation', 'transport', 'unemployment_rate', 'urbanization']
['gdp', 'industry_acco

In [7]:
f = FipsQuery(42001, feature_groups_with_weights={"population": 4, "spending_HHS": 3})
f.find_euclidean_kins()
f.plot_weights()
display(f.euclidean_kins)

Unnamed: 0,GeoFIPS,GeoName,1993_population,1994_population,1995_population,1996_population,1997_population,1998_population,1999_population,2000_population,...,2013_spending_HHS,2014_spending_HHS,2015_spending_HHS,2016_spending_HHS,2017_spending_HHS,2018_spending_HHS,2019_spending_HHS,2020_spending_HHS,2021_spending_HHS,distance to 42001
2227,42001,"Adams, PA",83013.0,84186.0,85063.0,86252.0,87751.0,89074.0,90363.0,91457.0,...,6.841459e+07,2.843109e+07,1.022756e+07,3.068268e+07,2.081153e+07,3.134013e+07,5.207267e+07,3.151378e+07,4.473508e+07,0.000000
1501,29071,"Franklin, MO",84234.0,85586.0,87584.0,89357.0,90710.0,91697.0,92914.0,94050.0,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,7.500000e+05,5.000000e+05,9.500000e+05,9.000000e+05,0.001296
3043,55127,"Walworth, WI",80707.0,83079.0,84933.0,86893.0,88862.0,90724.0,92407.0,92328.0,...,0.000000e+00,0.000000e+00,0.000000e+00,2.376715e+07,5.118558e+07,6.124352e+07,5.733653e+07,5.592463e+07,4.706138e+07,0.001933
207,6057,"Nevada, CA",83887.0,85185.0,86523.0,87861.0,89137.0,90170.0,90899.0,92520.0,...,3.942539e+06,3.645560e+06,4.258991e+07,4.172342e+07,1.388350e+08,2.973797e+08,2.465198e+08,1.116041e+08,2.074995e+08,0.002467
2996,55035,"Eau Claire, WI",88289.0,89072.0,89868.0,90694.0,91487.0,91870.0,92618.0,93310.0,...,3.123919e+06,1.503982e+06,2.112040e+06,2.560289e+06,1.814911e+06,1.595182e+06,5.516769e+05,5.298087e+06,8.091123e+06,0.002480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2732,48453,"Travis, TX",649226.0,671759.0,696278.0,717194.0,736587.0,761335.0,788500.0,819692.0,...,2.225282e+11,2.231797e+11,2.118196e+11,2.138400e+11,2.527288e+11,3.552319e+11,4.896949e+11,6.288495e+11,7.776532e+11,1.973836
600,17031,"Cook, IL",5235344.0,5262162.0,5283463.0,5306511.0,5322117.0,5345537.0,5365344.0,5373418.0,...,1.945389e+10,1.879525e+10,1.811116e+10,2.067402e+10,2.868427e+10,3.211947e+10,4.050188e+10,4.095196e+10,4.722273e+10,2.016573
212,6067,"Sacramento, CA",1127608.0,1130094.0,1140825.0,1155635.0,1169855.0,1186617.0,1206659.0,1229940.0,...,3.614194e+11,4.741250e+11,3.639794e+11,3.878961e+11,3.388018e+11,1.298456e+12,1.235765e+12,1.717218e+12,1.842294e+12,2.131305
197,6037,"Los Angeles, CA",9100159.0,9096608.0,9089015.0,9127042.0,9206538.0,9313589.0,9437290.0,9538191.0,...,2.348763e+10,2.294435e+10,2.502567e+10,2.839330e+10,3.868660e+10,5.899629e+10,1.410259e+11,2.187211e+11,2.577860e+11,2.164873


#### Use case: outcome with weight 0

In this example, we assign a weight of 0 to the outcome. This effectively excludes the outcome from similarity calculations, but allows for plotting thereof.

In [8]:
# you want to pass an outcome but give it weight 0 in similarity calculations

print(list_available_features())
f = FipsQuery(
    42001,
    outcome_var="spending_HHS",
    feature_groups_with_weights={"spending_HHS": 0, "population": 4},
)
f.find_euclidean_kins()
f.plot_weights()
display(f.euclidean_kins)
f.plot_kins()

['ethnic_composition', 'gdp', 'industry', 'industry_accommodation_food_services_total', 'industry_admin_support_services_total', 'industry_agriculture_total', 'industry_arts_recreation_total', 'industry_construction_total', 'industry_educational_services_total', 'industry_finance_insurance_total', 'industry_healthcare_social_services_total', 'industry_information_total', 'industry_management_enterprises_total', 'industry_manufacturing_total', 'industry_mining_total', 'industry_other_services_total', 'industry_professional_services_total', 'industry_public_administration_total', 'industry_real_estate_total', 'industry_retail_trade_total', 'industry_transportation_warehousing_total', 'industry_utilities_total', 'industry_wholesale_trade_total', 'medianHouseholdIncome', 'population', 'povertyAll', 'povertyAllprct', 'povertyUnder18', 'povertyUnder18prct', 'spending_HHS', 'spending_commerce', 'spending_transportation', 'transport', 'unemployment_rate', 'urbanization']


Unnamed: 0,GeoFIPS,GeoName,2010,2011,2012,2013,2014,2015,2016,2017,...,2014_population,2015_population,2016_population,2017_population,2018_population,2019_population,2020_population,2021_population,distance to 42001,percentile
0,42001,"Adams, PA",2.771827e+07,2.855134e+07,1.427164e+07,6.841459e+07,2.843109e+07,1.022756e+07,3.068268e+07,2.081153e+07,...,101830.0,102411.0,102625.0,103414.0,103932.0,103778.0,103795.0,104127.0,0.000000,63.83
1,29071,"Franklin, MO",0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,102058.0,102429.0,102952.0,103563.0,103967.0,104137.0,104769.0,105231.0,0.001140,44.83
2,55127,"Walworth, WI",0.000000e+00,1.003899e+06,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,2.376715e+07,5.118558e+07,...,104251.0,103812.0,104223.0,104698.0,105618.0,106228.0,106499.0,106799.0,0.001850,64.31
3,6057,"Nevada, CA",2.009387e+06,3.769145e+06,3.358054e+06,3.942539e+06,3.645560e+06,4.258991e+07,4.172342e+07,1.388350e+08,...,99649.0,100009.0,100485.0,101226.0,101530.0,101962.0,102199.0,103487.0,0.002104,81.78
4,55035,"Eau Claire, WI",1.946690e+06,4.980976e+06,3.021742e+06,3.123919e+06,1.503982e+06,2.112040e+06,2.560289e+06,1.814911e+06,...,101989.0,102445.0,103351.0,104091.0,104755.0,105002.0,105818.0,106452.0,0.002412,53.19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3069,6073,"San Diego, CA",1.101538e+10,1.497979e+10,1.292272e+10,1.524681e+10,1.384056e+10,1.543447e+10,1.729092e+10,2.192470e+10,...,3234658.0,3262566.0,3283586.0,3293575.0,3303463.0,3297959.0,3297252.0,3286069.0,1.833863,98.05
3070,4013,"Maricopa, AZ",5.202028e+10,6.495048e+10,5.861352e+10,6.047153e+10,6.486485e+10,6.829313e+10,7.820898e+10,1.406561e+11,...,4040171.0,4105747.0,4174844.0,4231511.0,4292576.0,4363816.0,4438342.0,4496588.0,1.918351,99.71
3071,48201,"Harris, TX",9.123876e+09,1.167109e+10,9.843632e+09,9.617106e+09,1.003390e+10,1.052874e+10,1.347828e+10,1.688323e+10,...,4452976.0,4553991.0,4619635.0,4651955.0,4672445.0,4704042.0,4732491.0,4728030.0,1.942420,98.08
3072,17031,"Cook, IL",1.461862e+10,1.965479e+10,1.758947e+10,1.945389e+10,1.879525e+10,1.811116e+10,2.067402e+10,2.868427e+10,...,5320233.0,5324961.0,5320293.0,5311621.0,5297956.0,5287099.0,5262741.0,5173146.0,1.980426,98.50


#### Use case: negative weight

This example illustrates the use of negative weights, which represent the degree of dissimilarity from the county of interest.

In [9]:
# the other queries still work

f = FipsQuery(
    1007,
    outcome_var="gdp",
    feature_groups_with_weights={
        "gdp": -2,
        "population": 1,
    },  # with one feature group only
    lag=0,
    top=5,
    time_decay=1.03,
)
f.find_euclidean_kins()
f.plot_weights()
display(f.euclidean_kins)
f.plot_kins()

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,2014_population,2015_population,2016_population,2017_population,2018_population,2019_population,2020_population,2021_population,distance to 1007,percentile
0,1007,"Bibb, AL",80.443,81.527,85.124,89.317,88.782,89.597,95.308,94.745,...,22586.0,22607.0,22654.0,22606.0,22383.0,22405.0,22223.0,22477.0,0.000000,47.76
1,48109,"Culberson, TX",35.264,37.743,36.255,38.339,40.177,41.247,42.368,53.349,...,2301.0,2275.0,2244.0,2259.0,2212.0,2186.0,2193.0,2193.0,1.898948,99.97
2,48389,"Reeves, TX",46.003,49.290,44.960,41.682,39.742,41.332,41.009,41.389,...,14614.0,14936.0,14484.0,14314.0,14526.0,14847.0,14730.0,14487.0,1.920264,99.90
3,31005,"Arthur, NE",126.123,151.196,151.543,154.669,168.961,203.866,119.700,113.293,...,437.0,433.0,445.0,432.0,435.0,436.0,431.0,439.0,1.926074,99.58
4,54017,"Doddridge, WV",45.487,45.897,45.933,46.619,49.647,54.491,56.730,55.200,...,8223.0,8392.0,8210.0,8108.0,8087.0,7922.0,7786.0,7735.0,1.950935,99.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3069,6059,"Orange, CA",82.036,84.241,89.790,94.090,99.072,102.799,102.301,99.220,...,3133194.0,3157254.0,3174233.0,3185766.0,3188828.0,3185685.0,3184101.0,3167809.0,2.910696,76.15
3070,4013,"Maricopa, AZ",77.463,80.415,85.793,90.349,97.615,102.729,105.313,103.632,...,4040171.0,4105747.0,4174844.0,4231511.0,4292576.0,4363816.0,4438342.0,4496588.0,2.913534,87.18
3071,48201,"Harris, TX",73.137,72.696,72.847,79.658,80.626,87.278,94.310,91.961,...,4452976.0,4553991.0,4619635.0,4651955.0,4672445.0,4704042.0,4732491.0,4728030.0,2.934048,51.04
3072,17031,"Cook, IL",95.406,94.886,95.455,97.260,99.315,101.320,101.826,99.238,...,5320233.0,5324961.0,5320293.0,5311621.0,5297956.0,5287099.0,5262741.0,5173146.0,2.943509,52.34


#### Use case: similarity in outcome patterns

You want to find top five jurisdictions with similar gdp time series patterns. You value times nearest to you a bit more, but also you only want the years 2003-2019 to be used for the outcome comparison.

In [10]:
f = FipsQuery(
    42001,
    "gdp",
    lag=0,
    top=10,
    time_decay=1.06,
    outcome_comparison_period=(2003, 2019),
    outcome_percentile_range=(40, 100),
)
f.find_euclidean_kins()
f.plot_weights()

In [11]:
# you can find the distances and  inspect the resulting
# dataframe that contains the ranking:
f.find_euclidean_kins()
display(f.euclidean_kins)

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,2014,2015,2016,2017,2018,2019,2020,2021,distance to 42001,percentile
0,42001,"Adams, PA",78.619,84.689,84.475,85.860,89.556,93.508,93.154,95.379,...,100.006,102.509,103.708,108.411,105.390,103.440,97.678,102.664,0.000000,35.04
1,24013,"Carroll, MD",76.152,80.700,82.853,86.013,89.398,95.381,95.463,97.823,...,99.510,101.215,101.568,106.456,104.838,105.452,101.050,105.298,0.028407,41.83
3,36029,"Erie, NY",85.085,86.619,87.762,90.526,90.342,92.417,93.693,95.281,...,101.917,104.175,104.843,104.038,105.414,107.910,104.293,109.868,0.033851,52.37
4,20037,"Crawford, KS",86.941,89.736,87.067,87.222,89.467,91.379,92.292,96.241,...,99.191,98.333,99.819,104.970,109.931,110.806,109.826,112.758,0.036416,58.98
6,13153,"Houston, GA",78.281,81.594,85.417,87.998,92.355,94.139,96.047,95.277,...,97.624,99.403,100.100,103.078,104.657,108.421,106.848,110.816,0.036827,54.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3069,48389,"Reeves, TX",46.003,49.290,44.960,41.682,39.742,41.332,41.009,41.389,...,212.473,334.183,416.768,607.235,843.967,1226.531,1136.483,942.206,1.468391,99.90
3070,54017,"Doddridge, WV",45.487,45.897,45.933,46.619,49.647,54.491,56.730,55.200,...,350.799,570.504,559.353,592.505,518.304,552.703,663.164,441.052,1.492258,99.84
3071,48233,"Hutchinson, TX",109.729,102.552,96.027,105.075,92.319,97.763,108.460,130.444,...,600.611,471.316,743.692,529.449,476.537,449.404,389.278,364.625,1.505063,99.74
3072,31005,"Arthur, NE",126.123,151.196,151.543,154.669,168.961,203.866,119.700,113.293,...,347.424,470.350,397.270,383.239,327.485,342.512,332.015,295.620,1.505257,99.58


In [12]:
# you can plot the few most similar cities:
fig = f.show_kins_plot()

### Use case: similarity in outcome patterns and in some other features

Say you want to include historical population patterns in your similarity ranking. You also want to pay a bit more attention to older data points. And you can now set weights to negative values to indicate that you care about dissimilarity in that feature.

In [13]:
f = FipsQuery(
    1007,
    outcome_var="gdp",
    feature_groups_with_weights={
        "gdp": -2,
        "population": 1,
    },  # with one feature group only
    # weights 1-4 won't make a difference
    lag=0,
    top=5,
    time_decay=1.03,
)
f.find_euclidean_kins()
# you still can inspect the resulting weighing:
f.plot_weights()

In [14]:
# you still have access to the distances and the ranking
display(f.euclidean_kins)

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,2014_population,2015_population,2016_population,2017_population,2018_population,2019_population,2020_population,2021_population,distance to 1007,percentile
0,1007,"Bibb, AL",80.443,81.527,85.124,89.317,88.782,89.597,95.308,94.745,...,22586.0,22607.0,22654.0,22606.0,22383.0,22405.0,22223.0,22477.0,0.000000,47.76
1,48109,"Culberson, TX",35.264,37.743,36.255,38.339,40.177,41.247,42.368,53.349,...,2301.0,2275.0,2244.0,2259.0,2212.0,2186.0,2193.0,2193.0,1.898948,99.97
2,48389,"Reeves, TX",46.003,49.290,44.960,41.682,39.742,41.332,41.009,41.389,...,14614.0,14936.0,14484.0,14314.0,14526.0,14847.0,14730.0,14487.0,1.920264,99.90
3,31005,"Arthur, NE",126.123,151.196,151.543,154.669,168.961,203.866,119.700,113.293,...,437.0,433.0,445.0,432.0,435.0,436.0,431.0,439.0,1.926074,99.58
4,54017,"Doddridge, WV",45.487,45.897,45.933,46.619,49.647,54.491,56.730,55.200,...,8223.0,8392.0,8210.0,8108.0,8087.0,7922.0,7786.0,7735.0,1.950935,99.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3069,6059,"Orange, CA",82.036,84.241,89.790,94.090,99.072,102.799,102.301,99.220,...,3133194.0,3157254.0,3174233.0,3185766.0,3188828.0,3185685.0,3184101.0,3167809.0,2.910696,76.15
3070,4013,"Maricopa, AZ",77.463,80.415,85.793,90.349,97.615,102.729,105.313,103.632,...,4040171.0,4105747.0,4174844.0,4231511.0,4292576.0,4363816.0,4438342.0,4496588.0,2.913534,87.18
3071,48201,"Harris, TX",73.137,72.696,72.847,79.658,80.626,87.278,94.310,91.961,...,4452976.0,4553991.0,4619635.0,4651955.0,4672445.0,4704042.0,4732491.0,4728030.0,2.934048,51.04
3072,17031,"Cook, IL",95.406,94.886,95.455,97.260,99.315,101.320,101.826,99.238,...,5320233.0,5324961.0,5320293.0,5311621.0,5297956.0,5287099.0,5262741.0,5173146.0,2.943509,52.34


In [15]:
# you still can plot the few top ranked cities:
fig = f.show_kins_plot()

#### Use case: similarity of outcome with a lag

You care about similarity of outcome variables, but your question now is: what other locations were 2 years ago in a similar place to me now, when it comes to the outcome variable and the features?


In [16]:
f = FipsQuery(42001, "gdp", lag=2, top=5, time_decay=1.06)
f.find_euclidean_kins()

f.plot_weights()

In [17]:
f.find_euclidean_kins()
f.euclidean_kins

Unnamed: 0,GeoFIPS,GeoName,2001,2002,2003,2004,2005,2006,2007,2008,...,2014,2015,2016,2017,2018,2019,2020,2021,distance to 42001,percentile
0,42001,"Adams, PA",78.619,84.689,84.475,85.860,89.556,93.508,93.154,95.379,...,100.006,102.509,103.708,108.411,105.390,103.440,97.678,102.664,0.000000,35.04
1,17097,"Lake, IL",78.656,81.216,83.875,87.882,89.657,94.695,97.707,94.213,...,101.651,104.679,102.943,102.954,105.339,105.335,101.592,106.468,0.031418,44.24
2,17001,"Adams, IL",79.654,81.654,86.491,90.160,91.589,93.582,93.367,95.001,...,102.400,101.760,104.586,97.057,99.635,98.477,93.510,100.614,0.034746,30.51
3,55127,"Walworth, WI",83.081,82.847,86.725,91.131,94.582,95.464,95.346,95.036,...,99.368,101.910,100.466,101.012,102.907,103.969,102.260,108.502,0.039798,49.54
4,25027,"Worcester, MA",83.429,84.954,88.210,89.841,90.625,92.234,93.830,96.316,...,101.656,103.801,104.461,104.997,107.212,107.542,104.070,109.597,0.042352,51.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3069,48389,"Reeves, TX",46.003,49.290,44.960,41.682,39.742,41.332,41.009,41.389,...,212.473,334.183,416.768,607.235,843.967,1226.531,1136.483,942.206,1.429865,99.90
3070,54017,"Doddridge, WV",45.487,45.897,45.933,46.619,49.647,54.491,56.730,55.200,...,350.799,570.504,559.353,592.505,518.304,552.703,663.164,441.052,1.450591,99.84
3071,48233,"Hutchinson, TX",109.729,102.552,96.027,105.075,92.319,97.763,108.460,130.444,...,600.611,471.316,743.692,529.449,476.537,449.404,389.278,364.625,1.463468,99.74
3072,31005,"Arthur, NE",126.123,151.196,151.543,154.669,168.961,203.866,119.700,113.293,...,347.424,470.350,397.270,383.239,327.485,342.512,332.015,295.620,1.469543,99.58


In [18]:
# notice the shift: their year 2019 is aligned with your year 2021!

fig = f.show_kins_plot()

#### Use case: similarity wrt. to population (but not outcome), with a lag

In [19]:
f = FipsQuery(
    20003,
    outcome_var="gdp",
    feature_groups_with_weights={"gdp": 0, "population": 4},
    lag=3,
    top=10,
    time_decay=1.03,
)
f.find_euclidean_kins()
f.plot_weights()

In [20]:
# if you want the full dataframe with distances,
# you still can get this
# it's just boring to print it all over again
# f.find_euclidean_kins()
# display(f.euclidean_kins)

# or, you can plot the few top ranked cities:
fig = f.show_kins_plot()