# Duplicates Detection Using Deep Learning

The effects of Big Data might affect unfavorably the process of duplicates detection when we want to integrate two or more databases. For example, the description of a product in Amazon might contain a lot of text and use specific vocabulary and grammatic style, and the description of the same product on ebay, can also contain a lot of text but its vocabulary and style might be completely different. Then, deciding on whether we are facing a duplicate or not, turns into a hard task. Fortunately, advances in the deep learning field, have enabled us to overcome these issues (to some extent) and have generated promising results. This jupyter notebook contains an example of how we can use deep learning to detect duplicates using the deepmatcher library. 

In [None]:
# Import DeepMatcher library
try:
    import deepmatcher
except:
    !pip install -qqq deepmatcher

[K     |████████████████████████████████| 51kB 2.4MB/s 
[K     |████████████████████████████████| 51kB 7.1MB/s 
[K     |████████████████████████████████| 296kB 16.3MB/s 
[?25h  Building wheel for deepmatcher (setup.py) ... [?25l[?25hdone
  Building wheel for fasttextmirror (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for fasttextmirror[0m
[?25h    Running setup.py install for fasttextmirror ... [?25l[?25hdone


In [None]:
import deepmatcher as dm

In [None]:
# Import other useful dependencies
import sys

In [None]:
# Import other things to make DeepMatcher works 
!{sys.executable} -m pip install torchtext==0.2.3



In [None]:
# Import Torch, a Machine Learning library needed by DeepMatcher
import torch

In [None]:
print(torch.__version__)

0.3.1


In [None]:
import deepmatcher as dm

In [None]:
# Upload data

# invoke a file selector (select the files)
from google.colab import files

uploaded = files.upload()

# iterate the uploaded files in order to find their key names
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving testutf.csv to testutf (1).csv
Saving trainutf.csv to trainutf (1).csv
Saving validutf.csv to validutf (1).csv
User uploaded file "testutf.csv" with length 826370 bytes
User uploaded file "trainutf.csv" with length 2333566 bytes
User uploaded file "validutf.csv" with length 815545 bytes


In [None]:
# get an idea about how the data looks
import pandas as pd
pd.read_csv('trainutf (1).csv').head()

Unnamed: 0,id,label,left_name,left_description,left_price,right_name,right_description,right_price
0,1917,0,lg 24 ' lds4821ww semi integrated built in whi...,lg 24 ' lds4821ww semi integrated built in whi...,,lg ldf6920bb fully integrated dishwasher,,
1,1918,0,speck seethru clear hard shell case for macboo...,speck seethru clear hard shell case for macboo...,,speck products seethru case for apple 13 ' mac...,plastic pink,
2,1919,0,denon blu-ray disc dvd/cd player dvd3800bdci,denon blu-ray disc dvd/cd player dvd3800bdci 1...,1999.0,denon dvd-2930ci dvd player dvd2930ci,"dvd + rw , dvd-rw , cd-rw dvd video , dvd audi...",448.0
3,1920,0,panasonic dect 6.0 expandable digital cordless...,panasonic dect 6.0 expandable digital cordless...,,panasonic kx-tg1032s dual handset digital cord...,1 x phone line ( s ) headset jack silver,61.14
4,1921,0,sony silver minidv handycam camcorder dcrhc52,sony silver minidv handycam camcorder dcrhc52 ...,,sony minidv head cleaner dvm12cld,head cleaner,7.95


In [None]:
# check the current working directory
import os
!pwd

In [None]:
# preprocessing (main tasks): 
# 1) attribute values are tokenized (e.g. divide: I'm -> I m) and lowercased (improve generalization), also "NaN"change to blank spaces - "These modifications help the neural network generalize better, i.e., perform better on data not trained on"
# 2) compute word embeddings (fastText by default)

train, validation, test = dm.data.process(
    path='/content',
    train='trainutf (1).csv',
    validation='validutf (1).csv',
    test='testutf (1).csv')


Reading and processing data from "/content/trainutf (1).csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "/content/validutf (1).csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "/content/testutf (1).csv"
0% [############################# ] 100% | ETA: 00:00:00
Building vocabulary
0% [######] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Computing principal components
0% [######] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


In [None]:
# take a look at how preprocessed data looks
train_table = train.get_raw_table()
train_table.head()



Unnamed: 0,id,label,left_name,left_description,left_price,right_name,right_description,right_price
0,1917,0,lg 24 ' lds4821ww semi integrated built in whi...,lg 24 ' lds4821ww semi integrated built in whi...,,lg ldf6920bb fully integrated dishwasher,,
1,1918,0,speck seethru clear hard shell case for macboo...,speck seethru clear hard shell case for macboo...,,speck products seethru case for apple 13 ' mac...,plastic pink,
2,1919,0,denon blu-ray disc dvd/cd player dvd3800bdci,denon blu-ray disc dvd/cd player dvd3800bdci 1...,1999.0,denon dvd-2930ci dvd player dvd2930ci,"dvd + rw , dvd-rw , cd-rw dvd video , dvd audi...",448.0
3,1920,0,panasonic dect 6.0 expandable digital cordless...,panasonic dect 6.0 expandable digital cordless...,,panasonic kx-tg1032s dual handset digital cord...,1 x phone line ( s ) headset jack silver,61.14
4,1921,0,sony silver minidv handycam camcorder dcrhc52,sony silver minidv handycam camcorder dcrhc52 ...,,sony minidv head cleaner dvm12cld,head cleaner,7.95


In [None]:
# define model for matching
# performs summarization
# performs comparison
model = dm.MatchingModel(attr_summarizer='hybrid')

In [None]:
# train the model
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',     
    pos_neg_ratio=3)

* Number of trainable parameters: 7133105
===>  TRAIN Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:00


Finished Epoch 1 || Run Time:   95.1 | Load Time:   25.8 || F1:   3.30 | Prec:  10.71 | Rec:   1.95 || Ex/s:  44.93

===>  EVAL Epoch 1


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 1 || Run Time:   13.3 | Load Time:    9.4 || F1:   5.38 | Prec:  35.29 | Rec:   2.91 || Ex/s:  84.54

* Best F1: 5.381165919282512
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:02


Finished Epoch 2 || Run Time:   96.4 | Load Time:   26.0 || F1:  27.16 | Prec:  29.73 | Rec:  25.00 || Ex/s:  44.38

===>  EVAL Epoch 2


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 2 || Run Time:   13.0 | Load Time:    9.2 || F1:  24.62 | Prec:  33.61 | Rec:  19.42 || Ex/s:  86.53

* Best F1: 24.615384615384617
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:02


Finished Epoch 3 || Run Time:   96.9 | Load Time:   26.2 || F1:  43.86 | Prec:  39.89 | Rec:  48.70 || Ex/s:  44.16

===>  EVAL Epoch 3


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 3 || Run Time:   13.2 | Load Time:    9.3 || F1:  30.43 | Prec:  34.57 | Rec:  27.18 || Ex/s:  84.86

* Best F1: 30.434782608695652
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:02


Finished Epoch 4 || Run Time:   96.6 | Load Time:   26.2 || F1:  51.04 | Prec:  44.55 | Rec:  59.74 || Ex/s:  44.26

===>  EVAL Epoch 4


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 4 || Run Time:   13.1 | Load Time:    9.2 || F1:  27.35 | Prec:  33.10 | Rec:  23.30 || Ex/s:  86.12

---------------------

===>  TRAIN Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:02


Finished Epoch 5 || Run Time:   96.4 | Load Time:   26.1 || F1:  59.55 | Prec:  52.48 | Rec:  68.83 || Ex/s:  44.37

===>  EVAL Epoch 5


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:23


Finished Epoch 5 || Run Time:   13.6 | Load Time:    9.5 || F1:  25.45 | Prec:  33.87 | Rec:  20.39 || Ex/s:  82.93

---------------------

===>  TRAIN Epoch 6


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:03


Finished Epoch 6 || Run Time:   97.6 | Load Time:   26.5 || F1:  68.39 | Prec:  61.34 | Rec:  77.27 || Ex/s:  43.80

===>  EVAL Epoch 6


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 6 || Run Time:   13.1 | Load Time:    9.2 || F1:  21.79 | Prec:  32.08 | Rec:  16.50 || Ex/s:  85.81

---------------------

===>  TRAIN Epoch 7


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:03


Finished Epoch 7 || Run Time:   97.2 | Load Time:   26.3 || F1:  77.30 | Prec:  69.33 | Rec:  87.34 || Ex/s:  44.00

===>  EVAL Epoch 7


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 7 || Run Time:   13.2 | Load Time:    9.3 || F1:  23.57 | Prec:  34.26 | Rec:  17.96 || Ex/s:  84.85

---------------------

===>  TRAIN Epoch 8


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:03


Finished Epoch 8 || Run Time:   97.1 | Load Time:   26.1 || F1:  80.41 | Prec:  72.70 | Rec:  89.94 || Ex/s:  44.11

===>  EVAL Epoch 8


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 8 || Run Time:   13.1 | Load Time:    9.2 || F1:  24.92 | Prec:  34.78 | Rec:  19.42 || Ex/s:  86.19

---------------------

===>  TRAIN Epoch 9


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:01


Finished Epoch 9 || Run Time:   95.9 | Load Time:   25.9 || F1:  83.16 | Prec:  75.73 | Rec:  92.21 || Ex/s:  44.62

===>  EVAL Epoch 9


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:22


Finished Epoch 9 || Run Time:   13.0 | Load Time:    9.2 || F1:  20.13 | Prec:  32.61 | Rec:  14.56 || Ex/s:  86.51

---------------------

===>  TRAIN Epoch 10


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:04


Finished Epoch 10 || Run Time:   98.4 | Load Time:   26.4 || F1:  86.19 | Prec:  80.17 | Rec:  93.18 || Ex/s:  43.55

===>  EVAL Epoch 10


0% [████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:23


Finished Epoch 10 || Run Time:   13.6 | Load Time:    9.5 || F1:  20.79 | Prec:  39.73 | Rec:  14.08 || Ex/s:  82.94

---------------------

Loading best model...
Training done.


30.434782608695652

In [None]:
# Evaluate (apply model on test data to estimate the performance of unseen data)
model.run_eval(test)

===>  EVAL Epoch 4


0% [████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18


Finished Epoch 4 || Run Time:    7.6 | Load Time:   10.4 || F1:  33.66 | Prec:  50.49 | Rec:  25.24 || Ex/s: 105.97



33.65695792880258

In [None]:
# Show how the predictions look like (see the % of similarity in the 1st column match_score)
valid_predictions = model.run_prediction(validation, output_attributes=True)
valid_predictions.head()


===>  PREDICT Epoch 3


0% [████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18


Finished Epoch 3 || Run Time:    7.8 | Load Time:   10.3 || F1:  45.09 | Prec:  49.71 | Rec:  41.26 || Ex/s: 105.88



Unnamed: 0_level_0,match_score,label,left_name,left_description,left_price,right_name,right_description,right_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7352,0.11454,0,sony red cyber-shot digital camera dscw150r,sony dscw150 red cyber-shot digital camera dsc...,,canon fs100 digital camera 2699b001,canon fs100 flash memory camcorder,299.95
7353,0.502252,1,belkin cush top for computer laptop f8n044slv,belkin cush top for computer laptop f8n044slv ...,,belkin cushtop f8n044-slv,,
7354,0.298178,0,garmin suction cup mount and 12-volt adapter k...,garmin suction cup mount and 12-volt adapter k...,30.0,garmin adjustable vehicle suction cup 010-1082...,,7.22
7355,0.174121,0,sennheiser rechargeable nickel-metal hydride b...,sennheiser rechargeable nickel-metal hydride b...,20.0,canon nb-4l rechargeable camera battery 9763a001,canon nb-4l lithium ion battery,36.89
7356,0.133399,0,sony cyber-shot black digital camera dsct500b,sony cyber-shot black digital camera dsct500b ...,,sony cyber-shot dsc-w120 digital camera silver...,16:9 2x digital zoom 2.5 ' active matrix tft c...,106.42
