# This file contains master-level homework on index structures

You know, that you can implement an arbitrary metric? To get full points your task will be the following:

1. Choose a small sample of the original data (e.g. 10K New York points).
2. Describe each point with coordinate vector and discrete set of words in the name. **Design the metric function** -- a function, which accepts 2 objects and returns a number. This function should esimate "distance" in a merged space of words and distances. No common words? Far. Both common words and vectors are similar? Close!
3. **Use this metric in index data structure**. Maybe you will [extend nmslib](https://github.com/nmslib/nmslib/issues/478), maybe you will prefer my [NSW implementation](https://github.com/IUCVLab/proximity-cut/blob/master/modules/nsw/nsw.py) ([usage](https://github.com/IUCVLab/proximity-cut/blob/master/tests/nsw-visualization.ipynb), [custom HVDM metric](https://github.com/IUCVLab/proximity-cut/blob/master/modules/tools/hvdm.py), custom metric application -- pass `dist=func` into constructor). Or maybe you will find a data structure which supports this from the box :)
4. Run some tests!

In [17]:
# copy all necessary code from another file here

In [18]:
import gdown
url = 'https://drive.google.com/uc?id=1LUudtCADqSxRl18ZzCzyPPGfhuUo2ZZs'
output = 'poi_sample_2M.zip'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1LUudtCADqSxRl18ZzCzyPPGfhuUo2ZZs
To: /content/poi_sample_2M.zip
100%|██████████| 105M/105M [00:00<00:00, 126MB/s] 


'poi_sample_2M.zip'

In [19]:
import zipfile
z = zipfile.ZipFile('poi_sample_2M.zip')
z.extractall()

In [20]:
import pickle
import matplotlib.pyplot as plt
import numpy as np

def split_shards(file, folder='shard', capacity=20000):
    import pickle, os, math, gc
    if not os.path.exists(folder):
        os.mkdir(folder)
    with open(file, "rb") as f:
        dataset = pickle.load(f)
    nshards = len(dataset) // capacity
    if nshards * capacity < len(dataset):
        nshards += 1
    
    for i in range(nshards):
        with open(f"{folder}/{i}", 'wb') as f:
            part = dataset[i * capacity:(i+1)*capacity]
            pickle.dump(part, f)
    dataset = None
    gc.collect()            

    
def dataset_get(indices, folder='shard', capacity=20000) -> list:
    result = []
    groups = {}
    for i in indices:
        x = i // capacity
        if x not in groups:
            groups[x] = []
        groups[x].append(i)
    for x in groups:
        with open(f"{folder}/{int(x)}", "rb") as f:
            sha = pickle.load(f)
            for i in groups[x]:
                row = sha[int(i % capacity)]
                result.append(row)
    return result


# should return iterator, which goes through all elements, consequently opening files
# use ``yield`` operator to simplify your code
def iterate_dataset(items, folder="shard", capacity=20000):
    # TODO write your code instead
    result = []
    groups = {}
    for i in range(items):
        x = i // capacity
        if x not in groups:
            groups[x] = []
        groups[x].append(i)
    for x in groups:
        with open(f"{folder}/{x}", "rb") as f:
            sha = pickle.load(f)
            for i in groups[x]:
                row = sha[i % capacity]
                yield row    

In [21]:
split_shards("poi_sample01.pickle")

In [22]:
import requests

# this function returns a pair of tuples: NE and SW corners.
def get_town_range_coordinates(town: str, google_maps_api_key: str) -> tuple:
  api = f"https://geocode-maps.yandex.ru/1.x/?format=json&apikey={google_maps_api_key}&geocode={town}"
  response = requests.get(api)
  temp = response.json()['response']["GeoObjectCollection"]['featureMember'][0]["GeoObject"]["boundedBy"]['Envelope']
  sw = [float(x.strip()) for x in temp['lowerCorner'].split()]
  ne = [float(x.strip()) for x in temp['upperCorner'].split()]
  return ne, sw

GEO_CACHE = {}
def get_town_range_coordinates_cached(town: str, maps_key: str) -> tuple:
    global GEO_CACHE
    if GEO_CACHE is None:
      GEO_CACHE = {}
    if town not in GEO_CACHE:      
      GEO_CACHE[town] = get_town_range_coordinates(town, maps_key)
    else: 
      print('Cache Hit!')
    return GEO_CACHE[town]

Replace the api key with a yandex api key

In [23]:
my_google_maps_api_key = None

In [24]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
!pip install spacy
!python -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [26]:
import spacy
nlp = spacy.load('en_core_web_md')

In [27]:
from nltk.tokenize import word_tokenize 
import numpy as np

def clean_word(sentence):
  temp = set(token.lower().strip()
    for token in word_tokenize(sentence)
    if token.isalpha() and token.lower() not in stop_words and
    token.lower() in nlp.vocab.strings)  
  if len(temp) > 0:
    return np.mean([nlp.vocab[token].vector for token in temp], axis=0)  

## 1. Sample some data

In [28]:
ne_bound, sw_bound = get_town_range_coordinates_cached('New york', my_google_maps_api_key)

In [29]:
datasets = []
for i,item in enumerate(iterate_dataset(2000000)):  
  loc = item[0]
  correct = sw_bound[0] <= loc[0] <= ne_bound[0] and sw_bound[1] <= loc[1] <= ne_bound[1] 
  if correct:
    temp = clean_word(item[1])
    if temp is not None:
      datasets.append((item[0],temp,i))
  if len(datasets) >= 10000:
    break

## 2. [M][10 points] Metric which accepts 2 dataset items and returns distance

In [30]:
from scipy.spatial.distance import cosine, euclidean

# example signature
def my_dist(a, b) -> float:
    ap, bp = a[0], b[0]
    atxt, btxt = a[1], b[1]
    return cosine(atxt, btxt) + euclidean(ap, bp)

## 3. [M][30 points] Build an index

We use this index implementation https://github.com/IUCVLab/proximity-cut/blob/master/modules/nsw/nsw.py. We have to download the file and reference it in the next line of code to use it.

In [31]:
execfile('nsw.py')

Module NSW launched as program.


In [32]:
from nsw import Node, NSWGraph

In [33]:
def get_index():
  G = NSWGraph([], my_dist)
  G.build_navigable_graph([(row, 0) for row in datasets])
  return G

In [34]:
index = get_index()

# 4. [M][10 points] write and pass some test

In [35]:
def find(town, query, index):
  try: 
    ne_bound, sw_bound = get_town_range_coordinates('New york', my_google_maps_api_key)
    ce_bound = (ne_bound[0]+ sw_bound[0])/2,(ne_bound[1]+ sw_bound[1])/2
    det = index.multi_search((ce_bound, clean_word(query)))
    return dataset_get([datasets[i][2] for i in det])
  except:
    return 'Item not found'

In [36]:
find("Manhattan", "coffee", index)

[([-74.093457, 40.70621],
  'Corner Coffee Shop, Food & Beverages, Candy. US, Jersey City, 129 Sterling Ave'),
 ([-74.026449, 40.752176],
  'Hollywood Deli & Grocery, Food & Beverages, Groceries & Convenience Stores. US, Hoboken, 1212 Washington St'),
 ([-73.962669, 40.634539],
  'Plaza Fruit & Vegetable Inc, Food & Beverages, Fruits & Vegetables. US, Brooklyn, 4 Newkirk Plz'),
 ([-74.033354, 40.618066],
  'Hardes Wine & Liquor, Food & Beverages, Liquor & Beverages. US, Brooklyn, 9314 3rd Ave'),
 ([-74.032083, 40.739785],
  'Sunshine Grocery Store Inc, Food & Beverages, Groceries & Convenience Stores. US, Hoboken, 240 Garden St')]