The purpose of this project is to create a comprehensive database of name variants to include in searches whenever a particular name is searched. It is a better name-matcher than Soundex. Read more about the Variant names project
This readme explains how to incorporate name variants into your own website.
The search module contains the code you need to incorporate name variants into your own website.
Normalizer.java - converts a user-entered text string to a normalized form - converts non-roman characters to corresponding roman characters, removes noise words, prepends surname prefixes, converts -dotter endings to -son, etc. You must call this on user-entered text before calling any other function. See NormalizerTest.java for an example.
Searcher.java - the two main functions are getAdditionalIndexTokens and getAdditionalSearchTokens. See SearcherTest.java for an example.
Each name should be indexed under its normalized form as well as any tokens returned by getAdditionalIndexTokens. getAdditionalIndexTokens returns the soundex code for the name when the name rare. It does not return any tokens otherwise. Thus, most names require just a single token (the normalized name itself), while some names require two tokens.
To return variants for a user-specified name at search time, search for the normalized form of the user-specified name as well as additional tokens returned by getAdditionalSearchTokens. On average, getAdditionalSearchTokens will return 25-35 tokens to include in searches.
surnamePrefixedNames.txt contains a list of prefixed-surnames (e.g., McDonald), and their unprefixed roots (e.g., Donald). According to the labeled pairs provided by Ancestry, unprefixed roots need to be included in searches for a prefixed surname, and vice-versa. But determining whether a name is prefixed is not easy. That is, Vandyke and Ohare are prefixed, but Vance and Olson are not. This table attempts to identify common prefixed surnames. Rare surnames use the prefix list in name-searcher.properties to estimate whether they are prefixed.
givenname_similar_names.csv is a table of variants for the 70,000 most-frequent given names in Ancestry.com's database.
surname_similar_names.csv is a table of variants for the 200,000 most-frequent surnames in Ancestry.com's database.
Rare names not in the above tables appear less than once in every 5,000,000 names in Ancestry's database, are indexed under their Soundex code by getAdditionalIndexTokens. getAdditionalSearchTokens includes these names by including the Soundex code of the searched-for name as one of the tokens to search.
Searcher.java reads the two above tables to determine whether a name must be indexed under a Soundex token (if it is not in the table), and what additional names to include in searches. If you don't want to use java, you could pretty easily write your own code to read the two tables.
The files will be updated periodically to incorporate user modifications made at the Variant names project. As part of this project, people are being encouraged not only to improve the names, but also to review the changes made by others. A changes log (see Changes log) will be included here so you can browse the changes.
The givenname_similar_names.csv and surname_similar_names.csv name-variants files used to be included directly in the github repository for convenience, but due to their size, I've moved them into separate downloadable files. You need to download these files. You can get the latest version of the files here. Download the files, unzip them, and put them on your classpath.
By default, Searcher.java reads the name-variants files from the classpath into memory. Alternatively, you can load them into a database, in which case the files don't need to be on your classpath. If you want to load the files into a database instead of reading them into memory, do the following:
create table givenname_similar_names ( name varchar(255) not null, confirmed_variants varchar(4096) not null, computer_variants varchar(4096) not null, primary key (name)); mysqlimport --fields-enclosed-by='"' --fields-terminated-by=',' --lines-terminated-by="\\n" --local <dbname> givenname_similar_names.csv create table surname_similar_names ( name varchar(255) not null, confirmed_variants varchar(4096) not null, computer_variants varchar(4096) not null, primary key (name)); mysqlimport --fields-enclosed-by='"' --fields-terminated-by=',' --lines-terminated-by="\\n" --local <dbname> surname_similar_names.csv
copy c3p0.properties.example to c3p0.properties, customize it as needed, and make sure it is on your classpath
copy db_memcache.properties.example to db_memcache.properties, customize it as needed, and make sure it is on your classpath
mvn install creates the normal jar file as well as one with all dependencies
The score module contains code to score a pair of names: return a number indicating how similar they are.
Scorer.java - contains the scoring function. See ScorerTest.java for an example of use. Remember to normalize the names beforehand.
DMSoundex.java - a simplified implementation of Daitch-Mokotov soundex
Nysiis.java - an implementation of NYSIIS
WeightedEditDistance.java - an edit distance algorithm, like Levenstein, but where each edit is assigned a weight, where the weights have been learned using an Expectation Maximization algorithm over the positive labled examples provided by Ancestry. By weighting edits, we can make the edit distance between ACE and APE greater than the edit distance between ACE and ASE.
FeaturesGenerator.java - when name pairs are scored, a set of features is generated for each pair. These features include whether the NYSIIS codes match, the Soundex codes match, the Refined Soundex codes match, the Daitch-Mokotov codes match, the value of the Levenstein distance, and the value of the Weighted Edit Distance. The weights for each feature were learned by running the labeled training data provided by Ancestry.com through Weka A number of other features were evaluated, but these features seemed to provide the best results.
SimilarNameGenerator.java returns a set of names similar to the specified name. See SimilarNameGeneratorTest.java for an example.
givennameClusters.txt and surnameClusters.txt - these files can be used to speed up SimilarNameGenerator. By clustering the names, rather than scoring each name against the specified name to determine if it is similar, we can score the cluster "centroids", and score the names in a cluster only if the centroid is closer than a certain threshold.
cmulex_lts.bin - this file was created by people at Carnegie Mellon University, and is used by the FreeTTS library to convert name strings to phonetic values. The phonetic values are used in WeightedEditDistance - it calculates the distance between the phonemes.
You might be tempted to index each non-rare name under its cluster centroid, so that you'd only have one token to look up at search time. You could do this, and you'll get reasonable results. But you'll get better results using the approach described in the search module. The reason is that while name X might be similar to name Y, and name Y might be similar to name Z, name X may not be very similar to name Z. In other words, similarity is reflexive but not transitive. Attempting to cluster names and searching on cluster centroids only, assumes transitivity, which doesn't exist.
Before building Scorer.java you must install the customized freetts.jar and William Cohen's secondstring library from the external directory into your local maven repository:
mvn install:install-file -Dfile=external/freetts.jar -DgroupId=com.sun.speech -DartifactId=freetts -Dversion=1.2.2-threadsafe -Dpackaging=jar mvn install:install-file -Dfile=external/secondstring.jar -DgroupId=com.wcohen.ss -DartifactId=secondstring -Dversion=20101021 -Dpackaging=jar
mvn install creates the normal jar file as well as one with all dependencies
Use this module if you want to evaluate the precision and recall of your own coder or name variants tables against Ancestry's labeled pairs.
CodeEvaluator.java shows how to evaluate a code-based approach; modify this to incorporate your coder.
TableEvaluator.java shows how to evaluate a table-based approach.
SimilarNameAdder.java is used to generate a core name-variants file based upon names scoring above a certain threshold.
SimilarNameAugmenter.java adds additional names to a name-variants file. If you had your own custom list of name variants, they could be added to the core name-variants file using this function. '''If you do have your own list of name variants, we hope you share them with us.'''
AncestryGivennamePairs.csv, AncestrySurnamePairs.csv, and BorderSurnamePairs.csv contain the labeled pairs provided by Ancestry. However, BorderSurnamePairs was found to not help, so it was not used.
givenname_ancestry.txt and surname_ancestry.txt contain the positively-labeled pairs from Ancestry's labeled data.
givenname_behindthename.txt contains a set of related givennames graciously donated by BehindTheName.com, an excellent resource for understanding how given names are related. In particular, the file lists all given names that are one "hop" away from the specified name in the "Family tree" view at BehindTheName, are of the same gender, and are not an ancient or medieval name.
givenname_nicknames.txt contains a set of nicknames.
givenname_werelate.txt and surname_werelate.txt contain name pairs given by users at WeRelate.org, the book "Dunkling, Leslie and William Gosling, The New American Dictionary of Baby Names, published by Signet (New American Library), 1985", and the book "Patrick Hanks and Flavia Hodges, A Dictionary of Surnames, Oxford University Press, 1990."
givenname_orig.csv and surname_orig.csv contain the original core of the names-variants tables.
To create the name-variants tables in the search module, first, SimilarNameAdder was used to add all names scoring above a threshold of 2.3 to each of the 70,000 most-frequent givennames and above a threshold of 0.7 to each of the 200,000 most-frequent surnames. For performance reasons, SimilarNameAdder adds considers only names greater than the specified name, so SimilarNameAugmenter was used with the -p option to add the reverse pairs.
The result is stored in givenname_orig.csv and surname_orig.csv.
At this point, the precision and recall of the name-variants tables against Ancestry's labeled pairs were:
- Given names: P/R=97.4/71.6
- Surnames: P/R=89.4/76.4
Compared to Soundex:
- Given names: P/R=97.2/64.6
- Surnames: P/R=88.6/67.8
This represents a 20% decrease in false-negatives (names that should be searched together but are not) for given names, and a 27% decrease in false-negatives for surnames.
Next, givenname_behindthename.txt\, givenname_nicknames.txt, and givenname_werelate.txt were added to the given names table, and surname_werelate.txt was added to the surnames table, using SimilarNameAugmenter from the Score module. The options passed to SimilarNameAugmenter were:
- behindthename: -p -r
- nicknames: -a
werelate: -p At this point, the Precision and Recall of the name-variants tables against Ancestry's labeled pairs were:
Given names: P/R=96.8/74.4
- Surnames: P/R=89.2/76.8
So we end up with a 28% decrease in false-negatives for both givennames and surnames.
Finally, we added the positively-labeled training pairs: givenname_ancestry.txt and surname_ancestry.txt so the initial tables would be as complete as possible to create the tables found in the Search module. User-modifications to these tables as part of the Variant names project will further modify, and hopefully improve, these tables.
If you want to re-create the P/R numbers above, you won't be able to test using the name-variants files in the Search module, since they include the training data. Instead, you'll need to start with the givenname_similar_names_orig.csv or surname_similar_names_orig.csv file in this module and follow the process outlined above. Then run TableEvaluator.java with the "-t" option to pass in the table you just created.
Also, if you use TableEvaluator.java or CodeEvaluator.java with AncestryGivennamePairs.csv or AncestrySurnamePairs.csv, you'll see some warnings about invalid lines. This is to be expected. Some lines in the training data contain invalid names, like "nn", which is an abbrevation for "no name," not an actual name. The evaluators recognize this, skip over the line, and list it as a warning.
Use this module if you want to generate search and index tokens or score names by issuing REST calls. This module allows you to use the Variant names project as a REST service instead of as a library.
The service endpoints are:
- index/type/name returns all tokens to index for the name.
- search/type/name returns all tokens to search for the name.
- score/type/name1/name2 returns the score of name1 and name2
In all cases, type is either "givenname" or "surname". You can request the result as XML or JSON by requesting a content type of "application/xml" or "application/json".
mvn package creates a war file that you can deploy to a servlet container like Tomcat
The source code is available under the Apache 2.0 license. Data files in the resources directories are available under a Creative Commons Attribution-ShareAlike license. See the LICENSE file for details.
Mar 2012 - v1.1
- distinguish between computer variants and confirmed variants
- new csv file format
- csv files moved from github project to separate download
Dec 2011 - v1.0
- initial commit
Overall, given names had an average of 32 variants, and surnames had an average of 26 variants. For search engines like Lucene, that's not too bad, because each variant lookup results in the union of a generally-small result set. However, the most-frequent 1000 given names and surnames had an average of 61 and 51 variants respectively, which means that the most-frequently searched names require a lot of extra lookups.
We can do better.
One thing that can be done to reduce the number of extra lookups necessary when searching for variants is to cluster the variants into lots of small clusters. Then instead of looking up each variant separately, Searcher.getAdditionalIndexTokens would return the cluster code for each name, and Searcher.getAdditionalSearchTokens would return the distinct cluster codes for each of the variants. By clustering the names according to how frequently they appear together as variants, we would expect the number of distinct clusters for each set of variants to be much smaller than the number of variants themselves.
Another approach would be to index names using a high-precision, low-recall coder like Refined Soundex. Then at search time look up each of the distinct refined soundex codes for the variants.
I'd love some help tackling this problem. Mahout looks like a good tool to generate clusters.
Check out other genealogy projects