# Getting started with DeepMatcher

Note: you can run **[this notebook live in Google Colab](https://colab.research.google.com/github/anhaidgroup/deepmatcher/blob/master/examples/getting_started.ipynb)** and use free GPUs provided by Google.

This tutorial describes how to effortlessly perform entity matching using deep neural networks. Specifically, we will see how to match pairs of tuples (also called data records or table rows) to determine if they refer to the same real world entity. To do so, we will need labeled examples as input, i.e., tuple pairs which have been annotated as matches or non-matches. This will be used to train our neural network using supervised learning. At the end of this tutorial, you will have a trained neural network as output which you can easily apply to unlabeled tuple pairs to make predictions.

As an overview, here are the 4 steps to use `deepmatcher` which we will go through in this tutorial:

<ol start="0">
  <li>Setup</li>
  <li>Process labeled data</li>
  <li>Define neural network model</li>
  <li>Train model</li>
  <li>Apply model to new data</li>
</ol>

Let's begin!

## Step 0. Setup

If you are running this notebook inside Colab, you will first need to install necessary packages by running the code below:

In [1]:
try:
    import deepmatcher
except:
    !pip install -qqq deepmatcher

[K     |████████████████████████████████| 51kB 2.5MB/s 
[K     |████████████████████████████████| 51kB 7.1MB/s 
[K     |████████████████████████████████| 296kB 14.6MB/s 
[?25h  Building wheel for deepmatcher (setup.py) ... [?25l[?25hdone
  Building wheel for fasttextmirror (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for fasttextmirror[0m
[?25h    Running setup.py install for fasttextmirror ... [?25l[?25hdone


Now let's import `deepmatcher` which will do all the heavy lifting to build and train neural network models for entity matching. 

In [2]:
import deepmatcher as dm

We recommend having a GPU available for the training in Step 4. In case a GPU is not available, we will use all available CPU cores. You can run the following command to determine if a GPU is available and will be used for training:

In [3]:
import torch
torch.cuda.is_available()

True

### Download sample data for entity matching

Now let's get some sample data to play with in this tutorial. We will need three sets of labeled data and one set of unlabeled data:

1. **Training Data:** This is used for training our neural network model.
2. **Validation Data:** This is used for determining the configuration (i.e., hyperparameters) of our model in such a way that the model does not overfit to the training set.
3. **Test Data:** This is used to estimate the performance of our trained model on unlabeled data.
4. **Unlabeled Data:** The trained model is applied on this data to obtain predictions, which can then be used for downstream tasks in practical application scenarios.

We download these four data sets to the `sample_data` directory:

In [27]:
!mkdir sample_data
!mkdir "sample_data/itunes-amazon"
!wget --no-check-certificate -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/train.csv
!wget --no-check-certificate -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/validation.csv
!wget --no-check-certificate -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/test.csv
!wget --no-check-certificate -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/unlabeled.csv

mkdir: cannot create directory ‘sample_data’: File exists
mkdir: cannot create directory ‘sample_data/itunes-amazon’: File exists


To get an idea of how our data looks like, let's take a peek at the training dataset:

In [5]:
import pandas as pd
pd.read_csv('sample_data/itunes-amazon/train.csv').head()

Unnamed: 0,id,label,left_Song_Name,left_Artist_Name,left_Album_Name,left_Genre,left_Price,left_CopyRight,left_Time,left_Released,right_Song_Name,right_Artist_Name,right_Album_Name,right_Genre,right_Price,right_CopyRight,right_Time,right_Released
0,448,0,Baby When the Light ( David Guetta & Fred Rist...,David Guetta,Pop Life ( Extended Version ) [ Bonus Version ],"Dance , Music , Rock , Pop , House , Electroni...",$ 1.29,‰ ãÑ 2007 Gum Records,6:17,18-Sep-07,Revolver ( Madonna Vs. David Guetta Feat . Lil...,David Guetta,One Love ( Deluxe Version ),Dance & Electronic,$ 1.29,( C ) 2014 Swedish House Mafia Holdings Ltd ( ...,3:18,"August 21 , 2009"
1,287,1,Outversion,Mark Ronson,Version,"Pop , Music , R&B / Soul,Soul,Dance,Rock,Jazz,...",$ 0.99,2007 Mark Ronson under exclusive license to SO...,1:50,10-Jul-07,Outversion,Mark Ronson,Version [ Explicit ],Pop,$ 0.99,( c ) 2011 J'adore Records,1:50,"July 10 , 2007"
2,534,0,Peer Pressure ( feat . Traci Nelson ),Snoop Dogg,Doggumentary,"Hip-Hop/Rap , Music , Rock , Gangsta Rap , Wes...",$ 1.29,"‰ ãÑ 2011 Capitol Records , LLC . All rights r...",4:07,29-Mar-11,Boom ( ( Feat . T-Pain ) [ Edited ] ),Snoop Dogg,Doggumentary [ Edited ],"Rap & Hip-Hop , West Coast",$ 1.29,"( C ) 2011 Capitol Records , LLC",3:50,"March 29 , 2011"
3,181,1,Stars Come Out ( Tim Mason Remix ),Zedd,Stars Come Out ( Remixes ) - EP,"Dance , Music , Electronic , House",$ 1.29,2012 Dim Mak Inc.,5:49,20-May-14,Stars Come Out ( Dillon Francis Remix ),Zedd,Stars Come Out [ Dillon Francis Remix ],Dance & Electronic,$ 1.29,2012 Dim Mak Inc.,4:08,"May 20 , 2014"
4,485,0,Jump ( feat . Nelly Furtado ),Flo Rida,R.O.O.T.S. ( Deluxe Version ),"Hip-Hop/Rap , Music",$ 1.29,‰ ãÑ 2009 Atlantic Recording Corporation for t...,3:28,30-Mar-09,"Yayo [ Feat . Brisco , Billy Blue , Ball Greez...",Flo Rida,R.O.O.T.S. ( Route Of Overcoming The Struggle ...,Rap & Hip-Hop,$ 1.29,"( C ) 2012 Motown Records , a Division of UMG ...",7:53,"March 30 , 2009"


## Step 1. Process labeled data

Before we can use our data for training, `deepmatcher` needs to first load and process it in order to prepare it for neural network training. Currently `deepmatcher` only supports processing CSV files. Each CSV file is assumed to have the following kinds of columns:

* **"Left" attributes (required):** Our goal is to match tuple pairs. "Left" attributes are columns that correspond to the "left" tuple or the first tuple in the tuple pair. These column names are expected to be prefixed with "left_" by default.
* **"Right" attributes (required):** "Right" attributes are columns that correspond to the "right" tuple or the second tuple in the tuple pair. These column names are expected to be prefixed with "right_" by default.
* **Label column (required for train, validation, test):** Column containing the labels (match or non-match) for each tuple pair. Expected to be named "label" by default
* **ID column (required):** Column containing a unique ID for each tuple pair. This is for evaluation convenience.  Expected to be named "id" by default.

More details on what data processing involves and ways to customize it are described in **[this notebook](https://nbviewer.jupyter.org/github/anhaidgroup/deepmatcher/blob/master/examples/data_processing.ipynb)**. 

### Processing labeled data
In order to process our train, validation and test CSV files we call `dm.data.process` in the following code snippet which will load and process the CSV files and return three processed `MatchingDataset` objects respectively. These dataset objects will later be used for training and evaluation. The basic parameters to `dm.data.process` are as follows:

* **path (required): ** The path where all data is stored. This includes train, validation and test. `deepmatcher` may create new files in this directory to store information about these data sets. This allows subsequent `dm.data.process` calls to be much faster.
* **train (required): ** File name of training data in `path` directory.
* **validation (required): ** File name of validation data in `path` directory.
* **test (optional): ** File name of test data in `path` directory.
* **ignore_columns (optional): ** Any columns in the CSV files that you may want to ignore for the purposes of training. These should be included here. 

Note that the train, validation and test CSVs must all share the same schema, i.e., they should have the same columns. Processing data involves several steps and can take several minutes to complete, especially if this is the first time you are running the `deepmatcher` package.

NOTE: If you are running this in Colab, you may get a message saying 'Memory usage is close to the limit.' You can safely ignore it for now. We are working on reducing the memory footprint.

In [6]:
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')


Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh


downloading from Google Drive; may take a few minutes


wiki.en.bin: 8.49GB [03:02, 46.7MB/s]
INFO:deepmatcher.data.field:Extracting vectors into /root/.vector_cache

Building vocabulary
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Computing principal components
0% [#] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


#### Peeking at processed data
Let's take a look at how the processed data looks like. To do this, we get the raw `pandas` table corresponding to the processed training dataset object. 

In [15]:
train_table = train.get_raw_table()
train_table.head()
train_table.index.name

The processed attribute values have been tokenized and lowercased so they may not look exactly the same as the input training data. These modifications help the neural network generalize better, i.e., perform better on data not trained on. 

## Step 2. Define neural network model

In this step you tell `deepmatcher` what kind of neural network you would like to use for entity matching. The easiest way to do this is to use one of the several kinds of neural network models that comes built-in with `deepmatcher`. To use a built-in network, construct a `dm.MatchingModel` as follows:

`model = dm.MatchingModel(attr_summarizer='<TYPE>')`

where `<TYPE>` is one of `sif`, `rnn`, `attention` or `hybrid`. If you are not familiar with what these mean, we strongly recommend taking a look at either **[slides from our talk on deepmatcher](http://bit.do/deepmatcher-talk)** for a high level overview, or **[our paper](http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf)** for a more detailed explanation. Here we give briefly describe the intuition behind these four model types:
* **sif:** This model considers the **words** present in each attribute value pair to determine a match or non-match. It does not take word order into account.
* **rnn:** This model considers the **sequences of words** present in each attribute value pair to determine a match or non-match.
* **attention:** This model considers the **alignment of words** present in each attribute value pair to determine a match or non-match. It does not take word order into account.
* **hybrid:** This model considers the **alignment of sequences of words** present in each attribute value pair to determine a match or non-match. This is the default.

`deepmatcher` is highly customizable and allows you to tune almost every aspect of the neural network model for your application scenario. **[This tutorial](https://nbviewer.jupyter.org/github/anhaidgroup/deepmatcher/blob/master/examples/matching_models.ipynb)** discusses the structure of `MatchingModel`s and how they can be customized.

For this tutorial, let's create a `hybrid` model for entity matching:

In [16]:
model = dm.MatchingModel(attr_summarizer='hybrid')

## Step 3. Train model

Next, we train the defined neural network model using the processed training and validation data. To do so, we call the `run_train` method which takes the following basic parameters:

* **train:** The processed training dataset object (of type `MatchingDataset`).
* **validation:** The processed validation dataset object (of type `MatchingDataset`).
* **epochs:** Number of times to go over the entire `train` data for training the model.
* **batch_size:** Number of labeled examples (tuple pairs) to use for each training step. This value may be increased if you have a lot of training data and would like to speed up training. The optimal value is dataset dependent.
* **best_save_path:** Path to save the best model.
* **pos_neg_ratio**: The ratio of the weight of positive examples (matches) to weight of negative examples (non-matches). This value should be increased if you have fewer matches than non-matches in your data. The optimal value is dataset dependent.

Many other aspects of the training algorithm can be customized. For details on this, please refer the API documentation for **[run_train]()**

In [17]:
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 17757810
===>  TRAIN Epoch 1


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 1 || Run Time:    6.2 | Load Time:    1.2 || F1:  44.66 | Prec:  35.66 | Rec:  59.74 || Ex/s:  43.24

===>  EVAL Epoch 1


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    0.9 | Load Time:    0.4 || F1:  60.32 | Prec:  48.72 | Rec:  79.17 || Ex/s:  82.65

* Best F1: tensor(60.3175, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 2 || Run Time:    5.5 | Load Time:    1.1 || F1:  78.41 | Prec:  69.70 | Rec:  89.61 || Ex/s:  48.49

===>  EVAL Epoch 2


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    0.9 | Load Time:    0.4 || F1:  70.00 | Prec:  58.33 | Rec:  87.50 || Ex/s:  85.14

* Best F1: tensor(70., device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 3


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 3 || Run Time:    5.9 | Load Time:    1.1 || F1:  91.46 | Prec:  86.21 | Rec:  97.40 || Ex/s:  45.89

===>  EVAL Epoch 3


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.9 | Load Time:    0.4 || F1:  74.19 | Prec:  60.53 | Rec:  95.83 || Ex/s:  87.02

* Best F1: tensor(74.1936, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 4 || Run Time:    5.4 | Load Time:    1.1 || F1:  96.20 | Prec:  93.83 | Rec:  98.70 || Ex/s:  49.36

===>  EVAL Epoch 4


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.9 | Load Time:    0.4 || F1:  83.64 | Prec:  74.19 | Rec:  95.83 || Ex/s:  87.00

* Best F1: tensor(83.6364, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 5


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 5 || Run Time:    5.5 | Load Time:    1.1 || F1:  97.40 | Prec:  97.40 | Rec:  97.40 || Ex/s:  48.80

===>  EVAL Epoch 5


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.9 | Load Time:    0.4 || F1:  92.00 | Prec:  88.46 | Rec:  95.83 || Ex/s:  85.63

* Best F1: tensor(92., device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 6


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 6 || Run Time:    5.5 | Load Time:    1.1 || F1:  98.70 | Prec:  98.70 | Rec:  98.70 || Ex/s:  48.51

===>  EVAL Epoch 6


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.9 | Load Time:    0.4 || F1:  91.67 | Prec:  91.67 | Rec:  91.67 || Ex/s:  85.83

---------------------

===>  TRAIN Epoch 7


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 7 || Run Time:    5.6 | Load Time:    1.1 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  47.58

===>  EVAL Epoch 7


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.9 | Load Time:    0.4 || F1:  88.00 | Prec:  84.62 | Rec:  91.67 || Ex/s:  83.81

---------------------

===>  TRAIN Epoch 8


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 8 || Run Time:    5.6 | Load Time:    1.1 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  47.68

===>  EVAL Epoch 8


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.9 | Load Time:    0.4 || F1:  90.20 | Prec:  85.19 | Rec:  95.83 || Ex/s:  85.28

---------------------

===>  TRAIN Epoch 9


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 9 || Run Time:    5.5 | Load Time:    1.1 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  48.61

===>  EVAL Epoch 9


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    0.9 | Load Time:    0.4 || F1:  90.20 | Prec:  85.19 | Rec:  95.83 || Ex/s:  87.47

---------------------

===>  TRAIN Epoch 10


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


Finished Epoch 10 || Run Time:    5.4 | Load Time:    1.1 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  49.24

===>  EVAL Epoch 10


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.9 | Load Time:    0.4 || F1:  90.20 | Prec:  85.19 | Rec:  95.83 || Ex/s:  86.77

---------------------

Loading best model...
Training done.


tensor(92., device='cuda:0')

## Step 4. Apply model to new data

### Evaluating on test data
Now that we have a trained model for entity matching, we can now evaluate its accuracy on test data, to estimate the performance of the model on unlabeled data.

In [18]:
# Compute F1 on test set
model.run_eval(test)

===>  EVAL Epoch 5
Finished Epoch 5 || Run Time:    0.6 | Load Time:    0.4 || F1:  85.25 | Prec:  86.67 | Rec:  83.87 || Ex/s: 112.56



tensor(85.2459, device='cuda:0')

In [22]:
try:
    import recordlinkage
except:
    !pip install -qqq recordlinkage

In [23]:
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('Song_Name', 'Song_Name', method='jarowinkler', threshold=0.9,label = 'Song_Name')
compare_cl.string('Artist_Name', 'Artist_Name', method='jarowinkler', threshold=0.9,label = 'Artist_Name')
compare_cl.exact('Album_Name', 'Album_Name', label = 'Album_Name')
compare_cl.exact('Genre', 'Genre',  label = 'Genre')

<Compare>

In [34]:
# Load data
true_links = {}
features = {}
validation = {}
datasets= {'train':r'sample_data/itunes-amazon/train.csv', 
'test':r'sample_data/itunes-amazon/test.csv',
'validation':r'sample_data/itunes-amazon/validation.csv'}

In [35]:
# Some helper functions
def print_results(name, table):
    print("".join(['*' for x in range(len(name) + 1)]))
    print('{}'.format(name))
    print("".join(['*' for x in range(len(name) + 1)]))
    print("Confusion matrix:")
    print(table['confusion_matrix'])
    print("Accuracy: {}".format(table['accuracy']))
    print("Recall: {}".format(table['recall']))
    print("F-score: {}".format(table['f-score']))
    print('\n')
    
def performance_metrics(true_links, result, set_size):
    validation = {}
    validation['confusion_matrix'] = recordlinkage.confusion_matrix(true_links, result, set_size)
    validation['accuracy'] = recordlinkage.accuracy(true_links, result, len(features['validation']))
    validation['recall'] = recordlinkage.recall(true_links, result)
    validation['f-score'] = recordlinkage.fscore(true_links, result)
    return validation

In [36]:
# Train and validate various classifiers
classifiers= {
'Hand-tuned':None,
'Logistic regression':recordlinkage.LogisticRegressionClassifier(),
'Naive Bayes': recordlinkage.NaiveBayesClassifier(),
'Support vector machine': recordlinkage.SVMClassifier(),
'K-means': recordlinkage.KMeansClassifier(),
'ECM': recordlinkage.ECMClassifier()}
for key in datasets:
    df = pd.read_csv(datasets[key])
    nof_cols = int((df.shape[1] - 2)/2)
    dfA = df.iloc[:,2:nof_cols + 2]
    dfB = df.iloc[:,nof_cols + 2:df.shape[1]]
    dfA.rename(columns={c:c[5:] for c in dfA.columns },inplace=True)
    dfB.rename(columns={c:c[6:] for c in dfB.columns },inplace=True)

    tuples = [(i,i) for i in range(len(df)) if df.iloc[i]['label'] == 1]
    true_links[key] = pd.MultiIndex.from_tuples(tuples)

    tuples_full = [(i,i) for i in range(len(df))]
    candidate_links = pd.MultiIndex.from_tuples(tuples_full)
    # Final features (used in other methods as well)
    features[key] = compare_cl.compute(candidate_links, dfA, dfB)

In [37]:
for key in classifiers:
    validation[key] = {}
    if key == 'Hand-tuned':
        # Immediate prediction
        result = features['validation'][features['validation'].sum(axis=1) > 2].index
    else:
        # Training 
        if key == 'ECM':
            classifiers[key].fit(features['train']) # somehow ECM cannot ignore redundant argument, opposed to K-Means
        else:
            classifiers[key].fit(features['train'], true_links['train'])
        # Predict the match status for all test record pairs
        result =  classifiers[key].predict(features['validation'])
        
    # Validate
    validation[key] = performance_metrics(true_links['validation'], result, len(features['validation']))
    
    #Print results
    print_results(key, validation[key])


***********
Hand-tuned
***********
Confusion matrix:
[[ 7 17]
 [ 0 84]]
Accuracy: 0.8425925925925926
Recall: 0.2916666666666667
F-score: 0.45161290322580644


********************
Logistic regression
********************
Confusion matrix:
[[18  6]
 [ 1 83]]
Accuracy: 0.9351851851851852
Recall: 0.75
F-score: 0.8372093023255814


************
Naive Bayes
************
Confusion matrix:
[[18  6]
 [ 1 83]]
Accuracy: 0.9351851851851852
Recall: 0.75
F-score: 0.8372093023255814


***********************
Support vector machine
***********************
Confusion matrix:
[[18  6]
 [ 1 83]]
Accuracy: 0.9351851851851852
Recall: 0.75
F-score: 0.8372093023255814


********
K-means
********
Confusion matrix:
[[ 7 17]
 [ 0 84]]
Accuracy: 0.8425925925925926
Recall: 0.2916666666666667
F-score: 0.45161290322580644


****
ECM
****
Confusion matrix:
[[ 9 15]
 [ 6 78]]
Accuracy: 0.8055555555555556
Recall: 0.375
F-score: 0.4615384615384615




In [38]:
# Test results for best method
f_scores = {key:validation[key]['f-score'] for key in validation}
best_model = max(f_scores, key = f_scores.get) 
if best_model == 'Hand-tuned':
    result = features['test'][features['test'].sum(axis=1) > 2].index
else:
    result = classifiers[best_model].predict(features['test'])
test =  performance_metrics(true_links['test'], result, len(features['test']))
print_results("Selected model ({}) on test set".format(best_model), test)


*************************************************
Selected model (Logistic regression) on test set
*************************************************
Confusion matrix:
[[25  6]
 [ 1 76]]
Accuracy: 0.9351851851851852
Recall: 0.8064516129032258
F-score: 0.8771929824561403


