### CS 838 &mdash; Data Science: Principles, Algorithms, and Applications; Spring 2017 ###

#  Stage 3: entity matching #

#### Trang Ho, Thomas Ngo, Qinyuan Sun

*****

### Table of Contents

1. [Introduction](#1.Introduction)
1. [Dataset](#2.Dataset)
1. [Training](#3.Training)
1. [Links](#4.Links)


## 1. Introduction ##

In this project stage, our team performed matching entities between two tables of education affiliations. The first table was extracted from the Academy of Management Conference (AOM) website, which contains personal information of the conference attendants in the year 2014. The personal information includes (1) individual name, (2) affiliation name, (3) country, (4) states/ province, (5) city, (6) contact numbers, (7) email address. Overall, this table consists of 9,532 entities at the individual level. 

The second table was extracted from the World Higher Education Database (WHED), which contains information of unique education affiliations worldwide. This table provides information on (1) affiliation name, (2) country, (3) street, (4) city, (5) province/ states, (6) postal code, (7) telephone number, and (8) website address (if available). Overall, this table consists of 17,605 unique entities at the affiliation level.

In order to match individuals' affiliations on the first table to affiliations on the second table, we used their overlapped/relevant information: (1) affiliation name, (2) country, (3) province/states, (4) city, (5) website address, (6) individual email address. Our goal here is to get precision score of above 95% and recall score of as high as possible.

Subsequently, we carried out the following steps using Magellan:
* Pre-processing
* Down-sizing the AOM table and the WHED table
* Using a blocker to reduce the size of the potential-candidate set
* Sampling randomly 500 pairs of potential candidates for labelling
* Creating training and testing sets I and J
* Training and selecting the best classifier using cross-validation

More details can be found below.

### Step 1. Pre-processing
In this step, we cleaned the two datasets by standardizing information on affiliation names, country, state/province, city, email server domain. For example, we standardized states by transforming "CA", "CA - California", "California" to "california" on both the AOM table and the WHED data.

### Step 2. Down-sizing
Initially, we have 9,532 entities on the AOM table and 17,605 entities on the WHED data. After down-sizing, we have 4,000 AOM entities and 4962 WHED entities

### Step 3. Blocking
Our blocking consists of the following components:
* Blocking all tuple pairs that have different countries
* For American affiliations, blocking all tuple pairs that have different province/ states
* For all affiliations, blocking all tuple pairs that have neither (1) any overlap between AOM email domain and WHED affiliation website domain nor (2) sufficient overlap coefficient (i.e. greater than 0.5) between affiliation names

As a result, we reduced the size of our candidate set from 19,848,000 (=4,000 x 4,962) to 126,516. 

### Step 4. Sampling for labelling
We initially sampled randomly 500 tuple pairs from the set of 126,516 potential candidates. After labeling, we dropped 22 cases due to ambiguity of the AOM information. Consequently, we had 478 tuple pairs with a density of approximately 34%.

### Step 5. Creating training & testing sets
We split the sample set into training and testing sets. As a result, each set has 239 tuple pairs.

### Step 6. Training and selecting the best classifiers
We used 6 learning methods for training on set I using cross validation. The methods include: (1) Decision Tree, (2) Random Forest, (3) SVM, (4) Naive Bayes, (5) Logistic Regression, and (6) Linear Regression. Below is the first-attempt accuracy performance of our classifiers:





## 2.Dataset ##

We set aside 100 text documents as [test set](https://github.com/TrangHo/cs838-code/tree/master/test-examples) to generate testing examples and use the rest of text documents as [train set](https://github.com/TrangHo/cs838-code/tree/master/train-texts) to generate training examples.

|                | Num. of documents| Num. of positive examples  | Num. of negative examples|
| -------------  |:----------------:| :-------------:            | :-------------:          |
| Training Set I | 200              |     725                    |  1948                    |
| Testing Set J  | 100              |     359                    |   898                    |
| Total          | 300              |     1084                   |  2846                    |



Subsequently, we used [four main regular-expression patterns](https://github.com/TrangHo/cs838-code/blob/master/src/lib/constants/patterns.py) to create a pool of potential negative-example candidates. The patterns suggest the following characteristics of negative candidates:

- having at least 2 words and all of them are capitalized
- having 2 captialized words with a prefix of at/from/in
- consisting of 3 or 4 words with a suffix of a noun usually goes with univerisities such as professor/student/etc.
- consisting of 3 or words with a prefix of a verb usually goes with with universities such as attend/receive

The final negative examples were then randomly selected from the pool. 




## 3.Training ##

To generate feature vectors from the positive and negative examples, we eventually designed 17 functions that (1) take a string and its surrounding texts, and (2) output either zero or one. Therefore our feature vector has 17 dimensions.

The machine learning algoirthms we employed are: 
- Support vector machine
- Decision tree
- Random forest
- Linear regresion
- Logistic regression 
- Multilayer perceptron neural network


We initially had only 16 features. The average precision and recall of 5-fold cross-validation are listed as follow. However, the results of our classifiers were close but did not meet the requirement of having (1) precision of 90% or higher and (2) recall of 50% or higher. After inspecting the false positives and false negatives, we found out that a prevalent problem was that single-word university names (such as Yale, Standford, and Columbia) were wrongly classifed as negatives. As a result, we added a dictionary of short names for popular universities for these case as feature 17. This feature significantly increases both precisons and recalls of all classifiers.

__Precision & Recall with 16 Features__

| Machine Learning Algorithm| Ave CV Precision | Ave CV Recall  |     F1    |
| ------------------------- |:----------------:| :-------------:|:---------:|
| Support Vector Machine    | 0.92             |     0.49       | 0.64      |
| Decsion Tree              | 0.89             |     0.54       | 0.67      | 
| Random Forest             | 0.89             |     0.54       | 0.67      | 
| Logistic Regression       | 0.90             |     0.50       | 0.64      | 
| Neural Network            | 0.88             |     0.54       | 0.67      | 

__Precision & Recall with 17 Features__

| Machine Learning Algorithm| Ave CV Precision | Ave CV Recall  |     F1    |
| ------------------------- |:----------------:| :-------------:|:---------:|
| Support Vector Machine    | 0.95             |     0.70       | 0.81      |
| Decsion Tree              | 0.93             |     0.72       | 0.81      |
| Random Forest             | 0.92             |     0.74       | 0.82      |
| Linear Regression         | 0.97             |     0.67       | 0.79      |
| Logistic Regression       | 0.95             |     0.70       | 0.81      |
| Neural Network            | 0.92             |     0.73       | 0.81      |

We chose Support Vector Machine as our classifier. We trained the classifier with all the training examples and tested on the testing examples. The results are shown in the following table.

|Type  |Precision| Recall |F1  |
| ---- |:-------:|:------:|:--:|
|TRAIN |0.93     | 0.73   |0.82|
|TEST  |0.97     | 0.72   |0.83|






## 4.Links ##

[link](https://github.com/TrangHo/cs838-code/tree/master/texts) to 300 text document

[link](https://github.com/TrangHo/cs838-code/tree/master/train-texts) to training set

[link](https://github.com/TrangHo/cs838-code/tree/master/test-examples) to test set

[link](https://github.com/TrangHo/cs838-code/tree/master/src) to source code

[link](https://github.com/TrangHo/cs838-spring2017/raw/master/cs838-stage2.zip) to a zip file for stage 2 related documents

