### Read Data

读取 Yelp Dataset 数据集

数据集中的 business.json 文件作为我们的数据库的内容\
其中经度 "longitude" 纬度 "latitude" 被作为是位置量\
其中键值 "categories" 作为算法中的 Keyword\
"business_id" 作为建立数据中的关键字的标识符


读取之后实现了一个数据库，数据库的结构是：\
| business_id | latitude | longitude | Keywords |
|-------------|----------|-----------|----------|
| b_001       | 34.0522  | -118.2437 | 餐饮, 健康 |
| b_002       | 40.7128  | -74.0060  | 零售, 数字化 |
| b_003       | 37.7749  | -122.4194 | 金融, 创新 |
| b_004       | 37.7749  | -122.4194 | 旅游, 文化 |
| b_005       | 37.7749  | -122.4194 | 娱乐, 教育 |

In [1]:
# 导入包
import json
import sqlite3


### Build Dataset
建立符合规模的数据库

In [1]:
import json
import random
import sqlite3
import time
from tqdm import tqdm

# 数据库路径
db_path = 'businesses.db'

# 创建或连接数据库
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# 创建表
cursor.execute('''
    CREATE TABLE IF NOT EXISTS business_table (
        business_id TEXT PRIMARY KEY,
        latitude REAL NOT NULL,
        longitude REAL NOT NULL,
        keywords TEXT NOT NULL
    )
''')

# JSON 文件路径
json_file_path = 'data/yelp_dataset/yelp_academic_dataset_business.json'

keyword_set = []

# 读取 JSON 文件并插入数据
with open(json_file_path, 'r', encoding='utf-8') as file:
    count = 0
    failed = 0
    for line in tqdm(file, total=150346, desc="Processing JSON"):
        count += 1
        try:
            data = json.loads(line)
            business_id = data['business_id']
            latitude = data['latitude']
            longitude = data['longitude']

            categories = data.get('categories', '').split(', ')
            if not categories:
                continue

            if len(keyword_set) >= 100:
                # 随机选择两个 keyword_set 中的关键字
                selected = random.sample(keyword_set, 2)
                selected_category = ', '.join(selected)
            else:
                # keyword_set 个数少于100， 继续从中选取
                if len(categories) >= 2:
                    selected = random.sample(categories, 2)
                    keyword_set.append(selected)
                else:
                    keyword_set.extend(selected) if selected else None
                selected_category = ', '.join(selected)


            # 插入数据
            cursor.execute(
                "INSERT OR IGNORE INTO business_table (business_id, latitude, longitude, keywords) VALUES (?, ?, ?, ?)",
                (business_id, latitude, longitude, selected_category)
            )
        except Exception as e:
            print(f"Error processing line {count}: {e}")
            failed += 1

print(f"Processed {count} lines, {count - failed} successful, {failed} failed")

# 提交事务
conn.commit()

# 查询数据
cursor.execute("SELECT COUNT(*) FROM business_table")
row_count = cursor.fetchone()[0]
print(f"Total records in business_table: {row_count}")

# 关闭连接
conn.close()

Processing JSON:  14%|█▎        | 20370/150346 [00:00<00:01, 101946.96it/s]

Error processing line 101: sequence item 0: expected str instance, list found
Error processing line 102: sequence item 0: expected str instance, list found
Error processing line 103: sequence item 0: expected str instance, list found
Error processing line 104: sequence item 0: expected str instance, list found
Error processing line 105: sequence item 0: expected str instance, list found
Error processing line 106: sequence item 0: expected str instance, list found
Error processing line 107: sequence item 0: expected str instance, list found
Error processing line 108: sequence item 0: expected str instance, list found
Error processing line 109: sequence item 0: expected str instance, list found
Error processing line 110: sequence item 0: expected str instance, list found
Error processing line 111: sequence item 0: expected str instance, list found
Error processing line 112: sequence item 0: expected str instance, list found
Error processing line 113: sequence item 0: expected str instanc

Processing JSON:  27%|██▋       | 40457/150346 [00:00<00:01, 97950.73it/s] IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

Processing JSON:  33%|███▎      | 50266/150346 [00:00<00:01, 96788.01it/s]IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

Processing JSON:  54%|█████▎    | 80708/150346 [00:00<00:00, 100194.19it/s]IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To chan

Error processing line 140978: sequence item 0: expected str instance, list found
Error processing line 140979: sequence item 0: expected str instance, list found
Error processing line 140980: sequence item 0: expected str instance, list found
Error processing line 140981: sequence item 0: expected str instance, list found
Error processing line 140982: sequence item 0: expected str instance, list found
Error processing line 140983: sequence item 0: expected str instance, list found
Error processing line 140984: sequence item 0: expected str instance, list found
Error processing line 140985: sequence item 0: expected str instance, list found
Error processing line 140986: sequence item 0: expected str instance, list found
Error processing line 140987: sequence item 0: expected str instance, list found
Error processing line 140988: sequence item 0: expected str instance, list found
Error processing line 140989: sequence item 0: expected str instance, list found
Error processing line 140990

In [3]:
import json

filename = "data/yelp_dataset/yelp_academic_dataset_business.json"
count = 0

with open(filename, 'r', encoding='utf-8') as file:
    for line in file:
        try:
            # 解析每行 JSON
            data = json.loads(line)
            # 检查是否存在 business_id
            if "business_id" in data:
                count += 1
        except json.JSONDecodeError:
            # 跳过解析失败的行
            continue

print(f"文件中有 {count} 个 business_id。")

文件中有 150346 个 business_id。


In [2]:
def hilbert_to_64bit_binary(hilbert_value):
    """
    将希尔伯特数值转换为64位二进制数。

    参数：
    - hilbert_value: 希尔伯特曲线数值

    返回：
    - 64位二进制字符串
    """
    binary_str = bin(hilbert_value)[2:]  # 去掉前缀 '0b'
    # 确保为64位，不足补0，超出则截断
    binary_64bit = binary_str[-64:].zfill(64)
    return binary_64bit

# 示例希尔伯特数值
hilbert_value = 5  # 假设的希尔伯特数值

# 转换为64位二进制数
binary_result = hilbert_to_64bit_binary(hilbert_value)

print(f"64位二进制数: {binary_result}")

64位二进制数: 0000000000000000000000000000000000000000000000000000000000000101


In [31]:
import numpy as np
from hilbert import encode, decode

# 定义地理坐标
latitude = 51.551126
longitude = -98.335695

# 定义希尔伯特曲线的参数
n_dimensions = 2  # 二维空间（纬度和经度）
n_bits = 10       # 选择合适的位数，决定了分辨率

# 1. 归一化到 [0, 1]
normalized_latitude = (latitude + 90) / 180
normalized_longitude = (longitude + 180) / 360

# 2. 缩放到 [0, 2^n_bits - 1]
max_value = 2**n_bits - 1
scaled_latitude = int(normalized_latitude * max_value)
scaled_longitude = int(normalized_longitude * max_value)

# 3. 转换为 NumPy 数组
points = np.array([scaled_latitude, scaled_longitude])

# 4. 编码为希尔伯特整数
hilbert_integer = encode(points, n_dimensions, n_bits)
print("希尔伯特整数:", hilbert_integer)

# 5. 解码回地理坐标（可选）
decoded_points = decode(hilbert_integer, n_dimensions, n_bits)
print(decoded_points)
print(decoded_points.dtype)
print(decoded_points.shape)
print(decoded_points[0][0])
print(decoded_points[0][1])
decoded_latitude = (decoded_points[0][0] / max_value) * 180 - 90
decoded_longitude = (decoded_points[0][1] / max_value) * 360 - 180
print("解码后的地理坐标:", (decoded_latitude, decoded_longitude))

希尔伯特整数: 1005786
[[804 232]]
uint64
(1, 2)
804
232
解码后的地理坐标: (51.46627565982405, -98.35777126099707)
