**POI Clustering**

*Step1: Get the category corresponding to POIs*

- **1.1 Split raw POIs data**
- 1.2 Add CAT_NUMBER to micode sheet(merged by TRADE_GROUP)
- 1.3 Add CAT_NUMBER to POI data (merged by MICODE)
- 1.4 Retain POI data with CAT_NUMBER between 101 and 115

# 1 Import packages

In [7]:
import pandas as pd

# 2 Split raw pois csv files

In [1]:
# Define input file path
input_file = '../../../../data/Locomizer/pois/2023-05/POIs_for_JCD_Finland_(micodes).csv'

In [13]:
# Open the file and read all lines
with open(input_file, 'r', encoding='utf-8') as file:
    lines = file.readlines()

# The first line is the columns, split by tabs
columns = lines[0].strip('"').split('\t')
columns[-1] = columns[-1].rstrip('",\n') # Clean the last column name

# Subsequent lines are data, processed line by line
data = [line.strip('"').split('\t') for line in lines[1:]]

# Create a DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first few rows of processed data
df.head()

Unnamed: 0,LATITUDE,LONGITUDE,NAME,BRANDNAME,FRANCHISE_NAME,TRADE_NAME,ISO3,AREANAME3,AREANAME2,ADD_NUMBER,...,SIC1,SIC2,SIC8,SIC8_DESCRIPTION,MICODE,TRADE_DIVISION,GROUP_NAME,MAIN_CLASS,SUB_CLASS,GLOBAL_ULTIMATE_BUSINESS_NAME
0,63.74856,28.67391,"PUUKARI TH"","" P",,,,FIN,PUUKARI,POHJOIS-KARJALA,1.0,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,""",,,,,,,,,,,\n"
1,64.58516,24.69568,TIKKALANTIE P,,,,FIN,RAAHE,POHJOIS-POHJANMAA,,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,""",,,,,,,,,,,,\n"
2,62.8689,27.63943,GOLF-FUNNY OY,,,,FIN,KUOPIO,POHJOIS-SAVO,,...,5941.0,,59410000.0,SPORTING GOODS AND BICYCLE SHOPS,10015941,DIVISION G. - RETAIL TRADE,MISCELLANEOUS RETAIL,MISCELLANEOUS SHOPPING GOODS STORES,SPORTING GOODS AND BICYCLE SHOPS,""",,,,,,,,,,,,\n"
3,63.94491,24.93294,LEPOLA I,,,,FIN,NIVALA,POHJOIS-POHJANMAA,119.0,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,""",,,,,,,,,,,,\n"
4,61.32062,22.77247,KIIKANOJANTIE P,,,,FIN,SASTAMALA,PIRKANMAA,,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,""",,,,,,,,,,,,\n"


In [14]:
# Remove unwanted characters from the last column name and its content
df[df.columns[-1]] = df[df.columns[-1]].str.rstrip('",\n')  # Clean the last column data

# Display the first few rows of processed data
df.head()

Unnamed: 0,LATITUDE,LONGITUDE,NAME,BRANDNAME,FRANCHISE_NAME,TRADE_NAME,ISO3,AREANAME3,AREANAME2,ADD_NUMBER,...,SIC1,SIC2,SIC8,SIC8_DESCRIPTION,MICODE,TRADE_DIVISION,GROUP_NAME,MAIN_CLASS,SUB_CLASS,GLOBAL_ULTIMATE_BUSINESS_NAME
0,63.74856,28.67391,"PUUKARI TH"","" P",,,,FIN,PUUKARI,POHJOIS-KARJALA,1.0,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,
1,64.58516,24.69568,TIKKALANTIE P,,,,FIN,RAAHE,POHJOIS-POHJANMAA,,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,
2,62.8689,27.63943,GOLF-FUNNY OY,,,,FIN,KUOPIO,POHJOIS-SAVO,,...,5941.0,,59410000.0,SPORTING GOODS AND BICYCLE SHOPS,10015941,DIVISION G. - RETAIL TRADE,MISCELLANEOUS RETAIL,MISCELLANEOUS SHOPPING GOODS STORES,SPORTING GOODS AND BICYCLE SHOPS,
3,63.94491,24.93294,LEPOLA I,,,,FIN,NIVALA,POHJOIS-POHJANMAA,119.0,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,
4,61.32062,22.77247,KIIKANOJANTIE P,,,,FIN,SASTAMALA,PIRKANMAA,,...,,,,BUS STOP,10320302,DIVISION E. - TRANSPORTATION AND PUBLIC UTILITIES,LOCAL AND SUBURBAN TRANSIT AND INTERURBAN HIGH...,PUBLIC TRANSPORT STOP,BUS STOP,


In [15]:
# Display the data type of each column
print(df.dtypes)

LATITUDE                         object
LONGITUDE                        object
NAME                             object
BRANDNAME                        object
FRANCHISE_NAME                   object
TRADE_NAME                       object
ISO3                             object
AREANAME3                        object
AREANAME2                        object
ADD_NUMBER                       object
STREETNAME                       object
POSTCODE                         object
FORMATTED_ADDRESS                object
MAIN_ADDRESS_LINE                object
ADDRESS_LAST_LINE                object
GEO_CONFIDENCE_CODE              object
HTTP                             object
BUSINESS_LINE                    object
SIC1                             object
SIC2                             object
SIC8                             object
SIC8_DESCRIPTION                 object
MICODE                           object
TRADE_DIVISION                   object
GROUP_NAME                       object


In [16]:
# View brief information about DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229738 entries, 0 to 229737
Data columns (total 28 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   LATITUDE                       229738 non-null  object
 1   LONGITUDE                      229738 non-null  object
 2   NAME                           229738 non-null  object
 3   BRANDNAME                      229738 non-null  object
 4   FRANCHISE_NAME                 229738 non-null  object
 5   TRADE_NAME                     229738 non-null  object
 6   ISO3                           229738 non-null  object
 7   AREANAME3                      229738 non-null  object
 8   AREANAME2                      229738 non-null  object
 9   ADD_NUMBER                     229738 non-null  object
 10  STREETNAME                     229738 non-null  object
 11  POSTCODE                       229738 non-null  object
 12  FORMATTED_ADDRESS              229738 non-nu

# 3 Export the DataFrame to CSV files

In [17]:
# Define output path
output_path = '../../../../data/Locomizer_edited/pois/2023-05_POIs_for_JCD_Finland_(micodes).csv'

In [18]:
# Export the DataFrame to a CSV file
df.to_csv(output_path, index=False, encoding='utf-8')