<h3> General Mapping Idea: </h3>
  
**Combined Rule-based Matching Strategy:** two sources of information: telephone number,  geological information (address & geo) <br>
1). parse geological information to retrive country, state, zip, etc., data of entities <br>
2). normalize telephone numbers (and ideally decompose telephone numbers into country code, area code, and national number for better matching outcomes)

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
from os import listdir
from os.path import isfile, join
import json
import gzip
import shutil 

# import geopy.distance

 ### 2. Define Functions

In [2]:
# concatenate dataframes
def concatenate_dataframe(source_path):
    files = os.listdir(source_path)
    # initialize a dataframe
    df_final = pd.DataFrame()
    for file in files:
        df = pd.read_json(os.path.join(source_path, file), compression='gzip', lines=True)
        df['table_id'] = file.strip('.json.gz')
        df_final = pd.concat([df_final, df], ignore_index=True)
        
    return df_final

<b> Geological Information

In [3]:
# preprocessing
class EntityPreprocessor:
    """
    iterator over cells in a column to flatten all dictionary-typed objects, 
    then extract values of all keys and save the values in the newly-added corresponding column
    
    Args:
    df: dataframe to be preprocessed 
    cols: columns to be flattened
    keys: a dictionary to store keys that have appeared in the columns
    
    Returns:
    a new dataframe with appended columns
    a dictionary store keys in user-specified columns
    """
    
    def __init__(self, df, cols=[], keys={}):
        self.df = df
        self.cols = cols
        self.keys = keys
    
    # collect keys 
    def collect_keys(self):
        df = self.df
        cols = self.cols
        keys = self.keys
        
        for col in cols:
            if not col in df.columns:
                continue
            # drop records without required data
            tmp = df[~df[col].isna()][col]
            # initialize a list to store keys appearing in the current column
            result_list = []
            for item in tmp:
                try:
                    for key in item.keys():
                        if not key in result_list:
                            result_list.append(key)
                except:
                    continue
            
            # append to the dictionary
            keys[col] = result_list
            
        self.keys = keys
        return keys

    # flatten a dictionary
    @staticmethod
    def _extract_value(iterator, key):
        if isinstance(iterator, dict): 
            return iterator.get(key, None)
        else: 
            return None

    # append new columns to store extracted infomation
    def column_parser(self, inplace=False): 
        df = self.df
        cols = self.cols
        keys = self.keys
        if keys == {}:
            raise Error("self.keys is an empty list. Please collect keys first")
        
        for col in cols:
            if not col in df.columns:
                continue
            keys_list = keys[col]
            for key in keys_list:
                # column_parser.key = key
                if key != 'telephone':
                    df[key] = None
                    df[key] = df[col].apply(EntityPreprocessor._extract_value, args=(key,))
                else:
                    df['telephone_'+ col] = df[col].apply(EntityPreprocessor._extract_value, args=(key,))
        if inplace:
            self.df = df
        else: 
            return df

In [4]:
# geo distance calculation
def geo_distance(coords_1, coords_2):
    return geopy.distance.vincenty(coords_1, coords_2).km

<b> Telephone Number

In [54]:
# telephone preprocessing


<b> Entity Name 

In [5]:
# jaccard similarity calculation
def jaccard_distance(s1, s2):
    tokenized_s1 = set(s1.split(' '))
    tokenized_s2 = set(s2.split(' '))
    overlap = tokenized_s1.intersection(tokenized_s2) 
    union = tokenized_s1.union(tokenized_s2) 
    
    return len(overlap)/len(union)    

### 3. Exploratory Analysis using Top100 Tables

In [17]:
path_parent = os.path.dirname(os.getcwd())
data_path = os.path.join(path_parent, 'src/data')

In [43]:
cd ..

/work-ceph/bizer-tp2021/data_integration_using_deep_learning/src/data/Hotel


In [47]:
cd ../Restaurant_top100

/work-ceph/bizer-tp2021/data_integration_using_deep_learning/src/data/Restaurant/Restaurant_top100


In [48]:
mkdir geo_preprocessed_v2

In [51]:
# preprocessing
source_path = os.path.join(data_path, 'Restaurant/Restaurant_minimum3/final_iteration')
target_path = os.path.join(data_path, 'Restaurant/Restaurant_minimum3/geo_preprocessed_v2')
gzfiles = os.listdir(source_path) 

for file in gzfiles: 
    if not file.endswith('.json.gz'):
        continue
    df_in = pd.read_json(os.path.join(source_path, file), compression='gzip', lines=True)
    local_business = EntityPreprocessor(df_in, cols=['address','geo'])
    keys = local_business.collect_keys()
    df_out = local_business.column_parser()
    df_out.to_json(os.path.join(target_path, file), compression='gzip', orient='records', lines=True)

In [50]:
# preprocessing
source_path = os.path.join(data_path, 'Restaurant/Restaurant_top100/final_iteration')
target_path = os.path.join(data_path, 'Restaurant/Restaurant_top100/geo_preprocessed_v2')
gzfiles = os.listdir(source_path) 

for file in gzfiles: 
    if not file.endswith('.json.gz'):
        continue
    df_in = pd.read_json(os.path.join(source_path, file), compression='gzip', lines=True)
    local_business = EntityPreprocessor(df_in, cols=['address','geo'])
    keys = local_business.collect_keys()
    df_out = local_business.column_parser()
    df_out.to_json(os.path.join(target_path, file), compression='gzip', orient='records', lines=True)

**Geo Info Preprocessing**  <br>
**Results** <br>
**1) Address Keys:**<br> 
'addressregion' (57.7%), 'postalcode'(42.8%), 'streetaddress'(52.7%),'addresslocality' (66.6%), 'addresscountry' (40.9%), 'postofficeboxnumber'(1.3%), 'citystatezip('1.3%'), 'telephone_address'(5.9%), 'faxnumber'(2.9%), 'sameas(3.1%)' 

It seems that *'addressregion', 'addresslocality', 'addresscountry', 'postalcode'* can be potentially used as identifiers. There is a python library called ***pgeocode*** that can extract more infomation from postal codes (conditional on country is known). (Ref: https://www.journaldev.com/49094/find-address-from-zip-code-in-python) <br> 

Besides, it is noticable that most records in 'addressregion' refer to states of different countries, which, however, might be named with synonyms. The normalized *'addressregion'* values can be used to select entties from a given country. Then the updated column *'addresscountry'* can be further be used during the parsing of telephone numbers and postal codes. 

**2) Geo Keys:** <br>
'longitude'('25.1%'), 'latitude'(30.4%)

The longitudes and latitudes of entities might refer to those of entity headquarters. Notice that *'address* often refers to a specific entity branch, a better decision is probably to replace the longitudes and latitudes from those extracted from postal codes using ***pgepcode***