<h1><center> Foursquare Location Matching </center></h1>
<h3><center> Pairs Data Generation </center></h3>

This notebook creates data pairs based on proximity and identifies all accurate data pairs. However, the issue of generating these pairs is that Kaggle's RAM capacity is insufficient when executing algorithms. Therefore, the sample size can be selected according to your computer's configuration to circumvent this problem


Competition: [Foursquare - Location Matching](https://www.kaggle.com/competitions/foursquare-location-matching)

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer

from geopy.geocoders import Nominatim

import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet as wn
from nltk.stem import PorterStemmer, WordNetLemmatizer

import Levenshtein as lev
import math
from collections import Counter

from pickle import dump, load
import time
from sklearn.neighbors import BallTree

import itertools
from tqdm.auto import tqdm
tqdm.pandas()
import gc

from fuzzywuzzy import fuzz
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
start_time = time.time()

In [2]:
train_data = pd.read_csv("/kaggle/input/foursquare-location-matching/train.csv")
train_merged = pd.merge(train_data, train_data, on='point_of_interest', suffixes=('_1', '_2'), how='inner')
train_pairs_true = train_merged[train_merged['id_1'] != train_merged['id_2']]
train_pairs_true = train_pairs_true.drop(['point_of_interest'], axis=1)
train_pairs_true['match'] = True
train_pairs_true.shape

(1901006, 25)

In [3]:
train_data_copy = train_data.copy()
train_data_copy.index = range(1,len(train_data_copy)+1)
train_data_copy = train_data_copy.add_suffix('_2')
non_pairs = pd.concat([train_data.add_suffix('_1'),train_data_copy], axis=1).dropna(subset=['point_of_interest_1', 'point_of_interest_2'])
non_pairs = non_pairs[non_pairs['point_of_interest_1'] != non_pairs['point_of_interest_2']]
non_pairs = non_pairs.drop(['point_of_interest_1', 'point_of_interest_2'], axis=1)
non_pairs['match'] = False


In [4]:
def create_match_loc(test, neighbour = 3):
    # minimum neighbour: 3 (include itself)
    if len(test) < neighbour:
        neighbour = len(test)
    tree = BallTree(np.deg2rad(test[['latitude', 'longitude']].values), metric='haversine')
    dist, ind = tree.query(np.deg2rad(test[['latitude', 'longitude']].values), k=neighbour)
    dist = dist[:,1:].squeeze()
    ind = ind[:,1:].squeeze()
    test_col = test.columns.tolist()
    combine_col = [str + '_1' for str in tqdm(test_col)] + [str + '_2' for str in tqdm(test_col)]
    df_combine = pd.DataFrame(np.concatenate([
                np.repeat(np.array(test), neighbour-1, axis = 0),
                test.iloc[list(itertools.chain.from_iterable(ind.tolist())),:]
               ], axis=1))    
    df_combine.columns = combine_col
    return df_combine  

In [5]:
train_pairs_close = create_match_loc(train_data, neighbour = 3)
train_pairs_close_True = train_pairs_close[train_pairs_close['point_of_interest_1'] == train_pairs_close['point_of_interest_2']]
train_pairs_close_False = train_pairs_close[train_pairs_close['point_of_interest_1'] != train_pairs_close['point_of_interest_2']]

train_pairs_close_True = train_pairs_close_True.drop(['point_of_interest_1','point_of_interest_2'], axis=1)
train_pairs_close_False = train_pairs_close_False.drop(['point_of_interest_1','point_of_interest_2'], axis=1)

train_pairs_close_True['match'] = True
train_pairs_close_False['match'] = False

  0%|          | 0/13 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

In [6]:
train_pairs = pd.concat([train_pairs_close_False,non_pairs,train_pairs_true],axis = 0)
train_pairs.shape

(4871594, 25)

In [7]:
train_pairs['match'].value_counts()

False    2970588
True     1901006
Name: match, dtype: int64

In [8]:
train_pairs = train_pairs.sample(frac=1).reset_index(drop=True) # shuffle
pairs_sample = pd.read_csv('../input/foursquare-location-matching/pairs.csv').iloc[0:2,:]
# change original data type
dtype_dict = pairs_sample.dtypes.apply(lambda x: x.name).to_dict()
del pairs_sample
gc.collect()
train_pairs = train_pairs.astype(dtype_dict)

In [9]:
train_pairs.head()

Unnamed: 0,id_1,name_1,latitude_1,longitude_1,address_1,city_1,state_1,zip_1,country_1,url_1,...,longitude_2,address_2,city_2,state_2,zip_2,country_2,url_2,phone_2,categories_2,match
0,E_d9142c07a2f2bd,SPAR,47.466338,19.005924,Rétköz utca 7.,Budapest,,1118.0,HU,https://www.spar.hu/hu_HU/uzletek.html,...,19.006568,Rétköz u. 7.,Budapest,,1118.0,HU,,3612462731.0,Gas Stations,False
1,E_46c0731f93f5ad,Denizin Dibinde,41.000384,29.061706,,,,,TR,,...,29.01138,Bogaz,,,,TR,,,Boats or Ferries,True
2,E_0bd9a22c7dbd34,церковь Успенская С Пароменья на Завеличье,57.819119,28.324169,,,,,RU,,...,28.323856,,Псков,,,RU,,,Hotels,False
3,E_cb68ef0969a1e0,Starbucks - Magic Kingdom,28.417825,-81.581066,,Bay Lake,FL,,US,,...,-81.581173,,Bay Lake,FL,,US,,,Coffee Shops,True
4,E_5657570ea56031,Angkringan Dewangga,-7.70467,109.026,,Cilacap,,,ID,,...,109.026301,Jalan Kalimantan,Cilacap,jawa tengah,,ID,,,"Markets, Farmers Markets",False


In [10]:
train_pairs.shape

(4871594, 25)

In [11]:
print("--- %s seconds ---" % (time.time() - start_time))

--- 759.1215319633484 seconds ---


In [12]:
train_pairs.to_pickle('./train_pairs_raw.pkl')

Reference: https://www.kaggle.com/code/ficklemaverick/generating-new-pairs-of-match-and-mismatch