Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

distinguish between computer and confirmed variants; new csv file for…

…mat; csv files moved to separate download
  • Loading branch information...
commit 8da47daabb295110e8a52eb87c0b6a341f116971 1 parent e683b7a
Dallan Quass authored
52 README.md
View
@@ -8,7 +8,7 @@ This readme explains how to incorporate name variants into your own website.
Search module
=============
-The search module contains everything you need to incorporate name variants into
+The search module contains the code you need to incorporate name variants into
your own website.
* _Normalizer.java_ - converts a user-entered text string to a normalized form -
@@ -50,27 +50,39 @@ names in Ancestry's database, are indexed under their Soundex code by
_getAdditionalIndexTokens_. _getAdditionalSearchTokens_ includes these names by
including the Soundex code of the searched-for name as one of the tokens to search.
-The above tables will be updated periodically to incorporate user modifications
+_Searcher.java_ reads the two above tables to determine whether a name must be
+indexed under a Soundex token (if it is not in the table), and what additional
+names to include in searches. If you don't want to use java, you could pretty
+easily write your own code to read the two tables.
+
+The files will be updated periodically to incorporate user modifications
made at the [Variant names project](http://www.werelate.org/wiki/WeRelate:Variant_names_project).
As part of this project, people are being encouraged not only to improve the names,
but also to review the changes made by others. A changes log
(see [Changes log](http://www.werelate.org/wiki/Special:NamesLog)) will be included
here so you can browse the changes.
-_Searcher.java_ reads the two above tables to determine whether a name must be
-indexed under a Soundex token (if it is not in the table), and what additional
-names to include in searches. If you don't want to use java, you could pretty
-easily write your own code to read the two tables.
+Downloading the name-variants files
+-----------------------------------
+
+The _givenname\_similar\_names.csv_ and _surname\_similar\_names.csv_
+name-variants files used to be included directly in the github repository for
+convenience, but due to their size, I've moved them into separate downloadable files.
+__You need to download these files.__
+You can get the latest version of the files [here](https://github.com/DallanQ/Names/wiki/Name-variant-files).
+Download the files, unzip them, and put them on your classpath.
-Installing the tables into a database
--------------------------------------
+Installing the name-variants files into a database
+--------------------------------------------------
-By default, _Searcher.java_ reads the above two tables into memory. If you want to
-read the tables from a database instead of reading them into memory, do the following:
+By default, _Searcher.java_ reads the name-variants files from the classpath into memory.
+Alternatively, you can load them into a database, in which case the files don't need to be on your classpath.
+If you want to load the files into a database instead of reading them into memory, do the following:
create table givenname_similar_names (
name varchar(255) not null,
- similar_names varchar(4096) not null,
+ confirmed_variants varchar(4096) not null,
+ computer_variants varchar(4096) not null,
primary key (name));
mysqlimport --fields-enclosed-by='"' --fields-terminated-by=','
@@ -78,7 +90,8 @@ read the tables from a database instead of reading them into memory, do the foll
create table surname_similar_names (
name varchar(255) not null,
- similar_names varchar(4096) not null,
+ confirmed_variants varchar(4096) not null,
+ computer_variants varchar(4096) not null,
primary key (name));
mysqlimport --fields-enclosed-by='"' --fields-terminated-by=','
@@ -88,7 +101,7 @@ In addition,
* copy c3p0.properties.example to c3p0.properties, customize it as needed, and make sure it is on your classpath
-* copy db\_memcache.properties.example to db\_memcache.properties, customize it as needed, and make sure it is on your classpth
+* copy db\_memcache.properties.example to db\_memcache.properties, customize it as needed, and make sure it is on your classpath
Building
--------
@@ -288,11 +301,18 @@ License
The source code is available under the [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0).
Data files in the resources directories are available under a [Creative Commons Attribution-ShareAlike license](http://creativecommons.org/licenses/by-sa/3.0/).
+See the LICENSE file for details.
-Support
-=======
+Change history
+==============
+
+Mar 2012 - v1.1
+* distinguish between computer variants and confirmed variants
+* new csv file format
+* csv files moved from github project to separate download
-Support is available via the [Google group](https://groups.google.com/group/folg-names)
+Dec 2011 - v1.0
+* initial commit
Roadmap
=======
2  eval/pom.xml
View
@@ -4,7 +4,7 @@
<parent>
<groupId>org.folg.names</groupId>
<artifactId>parent</artifactId>
- <version>1.0</version>
+ <version>1.1</version>
</parent>
<artifactId>eval</artifactId>
2  pom.xml
View
@@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>org.folg.names</groupId>
<artifactId>parent</artifactId>
- <version>1.0</version>
+ <version>1.1</version>
<packaging>pom</packaging>
<name>Name standard</name>
<modules>
2  score/pom.xml
View
@@ -4,7 +4,7 @@
<parent>
<groupId>org.folg.names</groupId>
<artifactId>parent</artifactId>
- <version>1.0</version>
+ <version>1.1</version>
</parent>
<artifactId>score</artifactId>
2  search/pom.xml
View
@@ -4,7 +4,7 @@
<parent>
<groupId>org.folg.names</groupId>
<artifactId>parent</artifactId>
- <version>1.0</version>
+ <version>1.1</version>
</parent>
<artifactId>search</artifactId>
149 search/src/main/java/org/folg/names/search/Searcher.java
View
@@ -46,30 +46,40 @@ public static Searcher getSurnameInstance() {
return surnameStandardizer;
}
+ public static class ConfirmedComputerVariants implements Serializable {
+ public String[] confirmedVariants = null;
+ public String[] computerVariants = null;
+
+ public ConfirmedComputerVariants(String[] confirmedVariants, String[] computerVariants) {
+ this.confirmedVariants = confirmedVariants;
+ this.computerVariants = computerVariants;
+ }
+ }
+
private static ComboPooledDataSource staticDS = null;
private static synchronized DataSource getDataSource(String driverClass, String jdbcUrl, String user, String password) {
- if (staticDS == null) {
- staticDS = new ComboPooledDataSource();
- try {
- Class.forName(driverClass).newInstance();
- staticDS.setDriverClass(driverClass);
- } catch (Exception e) {
- throw new RuntimeException("Error loading database driver: "+e.getMessage());
- }
- staticDS.setJdbcUrl(jdbcUrl);
- staticDS.setUser(user);
- staticDS.setPassword(password);
- Runtime.getRuntime().addShutdownHook(new Thread() {
- public void run() {
- try {
- DataSources.destroy(staticDS);
- } catch (SQLException e) {
- // ignore
- }
- }
- });
- }
- return staticDS;
+ if (staticDS == null) {
+ staticDS = new ComboPooledDataSource();
+ try {
+ Class.forName(driverClass).newInstance();
+ staticDS.setDriverClass(driverClass);
+ } catch (Exception e) {
+ throw new RuntimeException("Error loading database driver: "+e.getMessage());
+ }
+ staticDS.setJdbcUrl(jdbcUrl);
+ staticDS.setUser(user);
+ staticDS.setPassword(password);
+ Runtime.getRuntime().addShutdownHook(new Thread() {
+ public void run() {
+ try {
+ DataSources.destroy(staticDS);
+ } catch (SQLException e) {
+ // ignore
+ }
+ }
+ });
+ }
+ return staticDS;
}
private static class DaemonBinaryConnectionFactory extends BinaryConnectionFactory {
@@ -85,7 +95,7 @@ private static synchronized MemcachedClient getMemcachedClient(String memcacheAd
if (staticMC == null) {
try {
staticMC = new MemcachedClient(new DaemonBinaryConnectionFactory(),
- AddrUtil.getAddresses(memcacheAddresses));
+ AddrUtil.getAddresses(memcacheAddresses));
} catch (IOException e) {
logger.warning("Unable to initialize memcache client");
}
@@ -97,7 +107,7 @@ private static synchronized MemcachedClient getMemcachedClient(String memcacheAd
private final boolean isSurname;
private Map<String,String[]> codeMap = null;
private Set<String> commonNames = null;
- private Map<String,String[]> similarNames = null;
+ private Map<String,ConfirmedComputerVariants> similarNames = null;
private final StringEncoder coder;
private Map<String,String> prefixed2base = null;
private Map<String,List<String>> base2prefixed = null;
@@ -144,9 +154,9 @@ private Searcher(final boolean isSurname) {
if (databaseDriver != null) {
// given and surname Standardizer's share the same dataSource
dataSource = getDataSource(databaseDriver,
- props.getProperty("databaseURL"),
- props.getProperty("databaseUser"),
- props.getProperty("databasePassword"));
+ props.getProperty("databaseURL"),
+ props.getProperty("databaseUser"),
+ props.getProperty("databasePassword"));
// given and surname Standardizer's share the same memcachedClient
String memcacheAddresses = props.getProperty("memcacheAddresses");
@@ -205,25 +215,32 @@ private Searcher(final boolean isSurname) {
*/
public void readSimilarNames(Reader reader) throws IOException {
BufferedReader bufReader = new BufferedReader(reader);
- similarNames = new HashMap<String, String[]>();
+ similarNames = new HashMap<String, ConfirmedComputerVariants>();
String line;
String[] empty = new String[0];
while ((line = bufReader.readLine()) != null) {
- // line is "name","similar names"
+ // line is "name","confirmed variants","computer variants"
String[] fields = line.split(",");
// intern strings so they don't take so much memory; similar names are often repeated
String name = fields[0].substring(1, fields[0].length()-1).intern();
- if (fields.length == 2 && fields[1].length() > 2) {
- String[] names = fields[1].substring(1, fields[1].length()-1).split(" ");
- for (int i = 0; i < names.length; i++) {
- names[i] = names[i].intern();
- }
- similarNames.put(name, names);
+ String[] confirmedVariants = empty;
+ String[] computerVariants = empty;
+ if (fields.length >= 2 && fields[1].length() > 2) {
+ confirmedVariants = getInternedNames(fields[1]);
}
- else {
- similarNames.put(name, empty);
+ if (fields.length == 3 && fields[2].length() > 2) {
+ computerVariants = getInternedNames(fields[2]);
}
+ similarNames.put(name, new ConfirmedComputerVariants(confirmedVariants, computerVariants));
+ }
+ }
+
+ private String[] getInternedNames(String field) {
+ String[] names = field.substring(1, field.length() - 1).split(" ");
+ for (int i = 0; i < names.length; i++) {
+ names[i] = names[i].intern();
}
+ return names;
}
/**
@@ -293,17 +310,20 @@ public void readBasenames(Reader reader) throws IOException {
}
}
- private String[] readSimilarNamesFromDb(String namePiece) {
+ private ConfirmedComputerVariants readSimilarNamesFromDb(String namePiece) {
Connection conn = null;
PreparedStatement stmt = null;
- String[] similarNames = null;
+ ConfirmedComputerVariants ccVariants = null;
try {
conn = dataSource.getConnection();
- stmt = conn.prepareStatement("SELECT similar_names from "+(isSurname ? "surname" : "givenname")+"_similar_names where name=?");
+ stmt = conn.prepareStatement("SELECT confirmed_variants, computer_variants from "+(isSurname ? "surname" : "givenname")+"_similar_names where name=?");
stmt.setString(1, namePiece);
ResultSet rs = stmt.executeQuery();
if (rs != null && rs.next()) {
- similarNames = rs.getString(1).split(" ");
+ String confirmedVariants = rs.getString(1);
+ String computerVariants = rs.getString(2);
+ ccVariants = new ConfirmedComputerVariants(confirmedVariants.length() > 0 ? confirmedVariants.split(" ") : new String[0],
+ computerVariants.length() > 0 ? computerVariants.split(" ") : new String[0]);
}
} catch (SQLException e) {
logger.warning("Error reading from db: "+e.getMessage());
@@ -320,7 +340,7 @@ public void readBasenames(Reader reader) throws IOException {
// ignore
}
}
- return similarNames;
+ return ccVariants;
}
// Returns the base of this surname if it starts with a probable prefix
@@ -392,57 +412,62 @@ public String getCode(String namePiece) {
return "";
}
- private void addSimilarNames(String namePiece, Collection<String> tokens) {
- String[] names = null;
+ /**
+ * Normally you should call getAdditionalSearchTokens.
+ * This function is called only if you want to distinguish between user-confirmed variants and computer-generated variants
+ * @param namePiece normalized name piece
+ * @return confirmed and computer variants
+ */
+ public ConfirmedComputerVariants getConfirmedComputerVariants(String namePiece) {
+ ConfirmedComputerVariants ccVariants = null;
boolean memcacheLookupFailed = false;
// if we read similar names from a file, look up there first
if (similarNames != null) {
- names = similarNames.get(namePiece);
+ ccVariants = similarNames.get(namePiece);
}
else {
// try the cache if we have one
if (memcachedClient != null) {
- names = (String[]) memcachedClient.get(memcacheKeyPrefix+namePiece);
- if (names == null) {
+ ccVariants = (ConfirmedComputerVariants) memcachedClient.get(memcacheKeyPrefix+namePiece);
+ if (ccVariants == null) {
memcacheLookupFailed = true;
}
}
// try the database
- if (names == null) {
- names = readSimilarNamesFromDb(namePiece);
+ if (ccVariants == null) {
+ ccVariants = readSimilarNamesFromDb(namePiece);
}
}
// if all else fails, get similar names from soundex code map
- if (names == null) {
+ if (ccVariants == null) {
+ String[] computerVariants = null;
try {
- names = codeMap.get(coder.encode(namePiece));
+ computerVariants = codeMap.get(coder.encode(namePiece));
} catch (EncoderException e) {
logger.warning("Error encoding: "+namePiece);
}
- if (names == null) {
- names = new String[0];
- }
+ ccVariants = new ConfirmedComputerVariants(new String[0], computerVariants != null ? computerVariants : new String[0]);
}
if (memcacheLookupFailed) {
- memcachedClient.set(memcacheKeyPrefix+namePiece, memcacheExpiration, names);
+ memcachedClient.set(memcacheKeyPrefix+namePiece, memcacheExpiration, ccVariants);
}
- Collections.addAll(tokens, names);
+ return ccVariants;
}
/**
* Return the set of similar names for a name piece
- * You don't normally call this function. Call getAdditionalSearch tokens to also include basenames and soundex tokens
+ * You don't normally call this function. Call getAdditionalSearchTokens to also include basenames and soundex tokens
* @param namePiece normalized name piece
* @return similar names
*/
public Collection<String> getSimilarNames(String namePiece) {
Collection<String> tokens = new HashSet<String>();
- addSimilarNames(namePiece, tokens);
+ addVariants(namePiece, tokens);
return tokens;
}
@@ -460,7 +485,13 @@ private void addSearchTokens(String namePiece, Collection<String> tokens, boolea
}
// include similar names (and codes)
- addSimilarNames(namePiece, tokens);
+ addVariants(namePiece, tokens);
+ }
+
+ private void addVariants(String namePiece, Collection<String> tokens) {
+ ConfirmedComputerVariants ccVariants = getConfirmedComputerVariants(namePiece);
+ tokens.addAll(Arrays.asList(ccVariants.confirmedVariants));
+ tokens.addAll(Arrays.asList(ccVariants.computerVariants));
}
public String getBasename(String namePiece) {
70,000 search/src/main/resources/givenname_similar_names.csv
View
0 additions, 70,000 deletions not shown
200,000 search/src/main/resources/surname_similar_names.csv
View
0 additions, 200,000 deletions not shown
2  service/pom.xml
View
@@ -4,7 +4,7 @@
<parent>
<groupId>org.folg.names</groupId>
<artifactId>parent</artifactId>
- <version>1.0</version>
+ <version>1.1</version>
</parent>
<artifactId>service</artifactId>
Please sign in to comment.
Something went wrong with that request. Please try again.