# Foursquare Location Matching

The data presented here comprises over one-and-a-half million place entries for hundreds of thousands of commercial Points-of-Interest (POIs) around the globe. 

My task is to determine which place entries describe the same point-of-interest. Though the data entries may represent or resemble entries for real places, they may also contain artificial information or additional noise.

## Libraries installing and dataset downloading

In [1]:
%%capture

! pip install kaggle
! pip install numpy
! pip install pandas
! pip install sklearn

In [2]:
# Flag to force to reload dataset
RELOAD = False

In [3]:
import os

# import Kaggle API to load dataset
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

# initialize Kaggle API
api = KaggleApi()
api.authenticate()

# download dataset from Kaggle to data folder
data_path = 'data'
api.competition_download_files('foursquare-location-matching', data_path, force=RELOAD, quiet=False)
# save filename: !ATTENTION! : it may not be wroking if many files are in folders
# then just name it manually 
dataset_file_name = os.listdir(data_path)[0]

foursquare-location-matching.zip: Skipping, found more recently modified local copy (use --force to force download)


### Data description and loading

The data presented here comprises over one-and-a-half million place entries for hundreds of thousands of commercial Points-of-Interest (POIs) around the globe. Your task is to determine which place entries describe the same point-of-interest. Though the data entries may represent or resemble entries for real places, they may also contain artificial information or additional noise.

#### Training Data

* *train.csv* - The training set, comprising eleven attribute fields for over one million place entries, together with:
    * `id` - A unique identifier for each entry.
    * `point_of_interest` - An identifier for the POI the entry represents. There may be one or many entries describing the same POI. Two entries "match" when they describe a common POI.
* *pairs.csv* - A pregenerated set of pairs of place entries from train.csv designed to improve detection of matches. You may wish to generate additional pairs to improve your model's ability to discriminate POIs.
    * `match` - Whether (`True` or `False`) the pair of entries describes a common POI.

#### Example Test Data

To help you author submission code, we include a few example instances selected from the test set. When you submit your notebook for scoring, this example data will be replaced by the actual test data. The actual test set has approximately 600,000 place entries with POIs that are distinct from the POIs in the training set.

* *test.csv* - A set of place entries with their recorded attribute fields, similar to the training set.
* *sample_submission.csv* - A sample submission file in the correct format.
    * `id` - The unique identifier for a place entry, one for each entry in the test set.
    * `matches` - A space delimited list of IDs for entries in the test set matching the given ID. Place entries always self-match.

In [4]:
# import libraries to work with paths and to read zipped file, as was downloaded from Kaggle
from zipfile import ZipFile
# import pandas for EDA
import pandas as pd

# Read train dataset (train.csv) to pandas DataFrame named df: it will be used for analysis
df = pd.read_csv(ZipFile(os.path.join(data_path, dataset_file_name)).open('train.csv'))

df_pairs = pd.read_csv(ZipFile(os.path.join(data_path, dataset_file_name)).open('pairs.csv'))

# Read test dataset (test.csv), to pandas DataFrame named df_validation. It will be used only to generate final predictions, which will be submitted
df_validation = pd.read_csv(ZipFile(os.path.join(data_path, dataset_file_name)).open('test.csv'))
# finally, we will download example of submission (there are no correct predictions there, it is just an example)
df_subm_example = pd.read_csv(ZipFile(os.path.join(data_path, dataset_file_name)).open('sample_submission.csv'))

In [5]:
# Check, that all dataframes are loaded and have correct shapes
print(f'Shape of df: {str(df.shape)}')
print(f'Shape of df_pairs: {str(df_pairs.shape)}')
print(f'Shape of df_validation: {str(df_validation.shape)}')
print(f'Shape of df_subm_example: {str(df_subm_example.shape)}')

Shape of df: (1138812, 13)
Shape of df_pairs: (578907, 25)
Shape of df_validation: (5, 12)
Shape of df_subm_example: (5, 2)


## Exploratory data analysis

First, let's take a look on `df`, analyse it's structure and data

In [6]:
df.head()

Unnamed: 0,id,name,latitude,longitude,address,city,state,zip,country,url,phone,categories,point_of_interest
0,E_000001272c6c5d,Café Stad Oudenaarde,50.859975,3.634196,Abdijstraat,Nederename,Oost-Vlaanderen,9700.0,BE,,,Bars,P_677e840bb6fc7e
1,E_000002eae2a589,Carioca Manero,-22.907225,-43.178244,,,,,BR,,,Brazilian Restaurants,P_d82910d8382a83
2,E_000007f24ebc95,ร้านตัดผมการาเกด,13.780813,100.4849,,,,,TH,,,Salons / Barbershops,P_b1066599e78477
3,E_000008a8ba4f48,Turkcell,37.84451,27.844202,Adnan Menderes Bulvarı,,,,TR,,,Mobile Phone Shops,P_b2ed86905a4cd3
4,E_00001d92066153,Restaurante Casa Cofiño,43.338196,-4.326821,,Caviedes,Cantabria,,ES,,,Spanish Restaurants,P_809a884d4407fb


In [7]:
print(f'Number of records in df: {df.shape[0]}')
print(f'Number of columns in df: {df.shape[1]}')

Number of records in df: 1138812
Number of columns in df: 13


In [8]:
print(f'Names of columns in df: {list(df.columns)}')

Names of columns in df: ['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state', 'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest']


DataFrame `df` contains entries of POIs. 
Let's go through the columns to describe them

First, let's create helper function `col_describe` to use it for different columns

In [9]:
def col_describe(column_name, df=df):
    t = df[column_name].dtype
    print(f"Type of `{column_name}` column in `df` is: {t}")
    if (t == object):
        print("Object in pandas means string")
    print(f"Number of NaNs in `{column_name}` column: {df.isna()[column_name].sum()}")
    n = len(df[column_name])
    nu = df[column_name].nunique()
    print(f"Total amount of records (column `{column_name}`) is {n} and number of unique values is {nu}")
    print(f"{nu / n * 100}% of the values in `{column_name}` column are unique")


Column `id` is a unique identifier of the entry

In [10]:
col_describe('id')

Type of `id` column in `df` is: object
Object in pandas means string
Number of NaNs in `id` column: 0
Total amount of records (column `id`) is 1138812 and number of unique values is 1138812
100.0% of the values in `id` column are unique


Column `name` is the name of POI entry

In [18]:
col_describe('name')

Type of `name` column in `df` is: object
Object in pandas means string
Number of NaNs in `name` column: 1
Total amount of records (column `name`) is 1138812 and number of unique values is 842086
73.9442506752651% of the values in `name` column are unique


Columns `latitude` and `longitude` are geographical coordinates of the reported location

In [20]:
col_describe('latitude')
print('\n')
col_describe('longitude')

Type of `latitude` column in `df` is: float64
Number of NaNs in `latitude` column: 0
Total amount of records (column `latitude`) is 1138812 and number of unique values is 1121701
98.49746929256102% of the values in `latitude` column are unique


Type of `longitude` column in `df` is: float64
Number of NaNs in `longitude` column: 0
Total amount of records (column `longitude`) is 1138812 and number of unique values is 1080273
94.85964320713164% of the values in `longitude` column are unique


Columns `address`, `city`, `state`, `zip`, `country` are describing address of the reported location

In [21]:
col_describe('address')
print('\n')
col_describe('city')
print('\n')
col_describe('state')
print('\n')
col_describe('zip')
print('\n')
col_describe('country')

Type of `address` column in `df` is: object
Object in pandas means string
Number of NaNs in `address` column: 396621
Total amount of records (column `address`) is 1138812 and number of unique values is 558154
49.01195280696024% of the values in `address` column are unique


Type of `city` column in `df` is: object
Object in pandas means string
Number of NaNs in `city` column: 299189
Total amount of records (column `city`) is 1138812 and number of unique values is 68105
5.980354966403585% of the values in `city` column are unique


Type of `state` column in `df` is: object
Object in pandas means string
Number of NaNs in `state` column: 420586
Total amount of records (column `state`) is 1138812 and number of unique values is 17596
1.5451189485182804% of the values in `state` column are unique


Type of `zip` column in `df` is: object
Object in pandas means string
Number of NaNs in `zip` column: 595426
Total amount of records (column `zip`) is 1138812 and number of unique values is 93329


Columns `url`, `phone` of the POI

In [22]:
col_describe('url')
print('\n')
col_describe('phone')

Type of `url` column in `df` is: object
Object in pandas means string
Number of NaNs in `url` column: 871088
Total amount of records (column `url`) is 1138812 and number of unique values is 171222
15.035141884700899% of the values in `url` column are unique


Type of `phone` column in `df` is: object
Object in pandas means string
Number of NaNs in `phone` column: 795957
Total amount of records (column `phone`) is 1138812 and number of unique values is 293454
25.768432366360734% of the values in `phone` column are unique


Column `point_of_interest`: an identifier for the POI the entry represents. There may be one or many entries describing the same POI. Two entries "match" when they describe a common POI.

In [23]:
col_describe('point_of_interest')

Type of `point_of_interest` column in `df` is: object
Object in pandas means string
Number of NaNs in `point_of_interest` column: 0
Total amount of records (column `point_of_interest`) is 1138812 and number of unique values is 739972
64.97753799573591% of the values in `point_of_interest` column are unique
