# IA317: Large-scale machine learning
# Sketching

In this lab, you will learn to work with [Min Hash](https://en.wikipedia.org/wiki/MinHash), a simple and efficient sketching algorithm to get approximate nearest neighbors for binary (sparse) data. You will find below some functions to build hashing tables and to find approximate $k$-nearest neighbors using Min Hash.

## Import

In [1]:
import numpy as np

In [2]:
from scipy import sparse

In [3]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

## Instructions

Please provide short answers to the questions at the bottom of the notebook. Most involve Python coding. Add as many cells as necessary (code and text). 

This lab is not graded but you might upload it on [eCampus](https://ecampus.paris-saclay.fr/course/view.php?id=18426) if you wish. Before that, make sure to:
* Delete all useless cells (tests, etc.)
* Check that **your code is running and does not produce any errors**. You might restart the kernel and run all cells at the end of the lab to check that this is indeed the case. 
* Keep the outputs.

## Data

The lab is based on the [20newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.

In [4]:
dataset_train = fetch_20newsgroups(subset='train')

In [5]:
target_names = dataset_train.target_names

In [6]:
print(len(target_names))

20


In [7]:
print(target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [8]:
print(dataset_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [9]:
y_train = dataset_train.target

In [10]:
print(target_names[y_train[0]])

rec.autos


In [11]:
np.unique(y_train, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([480, 584, 591, 590, 578, 593, 585, 594, 598, 597, 600, 595, 591,
        594, 593, 599, 546, 564, 465, 377], dtype=int64))

In [12]:
dataset_test = fetch_20newsgroups(subset='test')

In [13]:
y_test = dataset_test.target

## Vectorization

The dataset is vectorized and binarized.

In [14]:
tf_vectorizer = CountVectorizer(min_df=5, max_df=0.2, stop_words='english')
tf_vectorizer.fit(dataset_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.2, max_features=None, min_df=5,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [15]:
X_train = tf_vectorizer.transform(dataset_train.data)
X_test = tf_vectorizer.transform(dataset_test.data)

In [16]:
X_train.data = np.ones(len(X_train.data))
X_test.data = np.ones(len(X_test.data))

In [17]:
X_train.shape

(11314, 25614)

In [18]:
X_test.shape

(7532, 25614)

## Min Hash

Here are some useful functions for Min Hash sketching.

In [19]:
def jaccard_similarity(x, X):
    '''Get Jaccard similarities between a target and a set of samples.
    
    Parameters
    ----------
    x : np.ndarray
        Vector of size d.
    X : np.ndarray or sparse csr matrix.
        Data, as array of shape (n,d).
        
    Returns
    -------
    sims : np.ndarray
        Jaccard similarities, as a vector of size n.
    '''
    
    card_inter = X.dot(x)
    card_union = X.dot(np.ones(X.shape[1])) + np.sum(x) - card_inter
    sims = card_inter/card_union
    
    return sims

In [20]:
def get_permutations(d, L = 100):
    '''Get permutations.
    
    Parameters
    ----------
    d : int
        Dimension (number of indices to shuffle).
    L : int
        Number of permutations.
        
    Returns
    -------
    Permutations : np.ndarray
        Permutations as array of shape (L,d)
    '''
    permutations = []
    for l in range(L):
        index = np.arange(d)
        np.random.shuffle(index)    
        permutations.append(list(index))
    return np.array(permutations)

In [21]:
def get_signature(X, permutations):
    '''Compute the MinHash of each sample.
    
    Parameters
    ----------
    X : sparse csr matrix.
        Data (binary features), shape (n, d).
    permutations : np.ndarray
        Permutations as array of shape (L,d)
        
    Returns
    -------
    signature : np.ndarray
        MinHash signature as array of shape (n, L)
    '''
    n = X.shape[0]
    L = permutations.shape[0]
    
    signatures = np.zeros((n,L))
    
    for i in range (L):
        for j in range (n):
            k = 0
            while X[j,permutations[i,k]] == 0 :
                k += 1
            signatures[j,i] = permutations[i,k]
            
    return signatures

In [22]:
def get_hash_tables(signature):
    '''Build hash tables.
    
    Parameters
    ----------
    signature : np.ndarray
        Data signature as array of shape (n, L)
        
    Returns
    -------
    hash_tables : list of dict
        List of L hash tables
    '''    
    hash_tables = []
    for sig in signature.T:
        hash_tables.append({s: list(np.argwhere(sig == s).ravel()) for s in np.unique(sig)})
    return hash_tables

In [23]:
def get_approximate_knn(signature_test, hash_tables, X_train, X_test, k = 3, factor = 10):
    '''Get approximate k-nearest neighbors (for Jaccard distance).
    
    Parameters
    ----------
    signature_test : np.ndarray
        Data signature as array of shape (n_test, L).
    hash_tables : list of dict
        List of L hash tables (based on train set).
    X_train : np.ndarray or sparse csr matrix
        Training data, shape (n_train, d).
    X_test : np.ndarray or sparse csr matrix
        Test data, shape (n_test, d)
    k : int
        Number of nearest neighbors.
    factor : int
        Multiplicative factor. 
        Nearest neighbors are searched in a list of factor * k samples.
        
    Returns
    -------
    nn : np.ndarray
        Approximate k-nearest neighbors, as arrays of shape (n_test, k).
        Each entry is in range(n_train)
    '''    
    nn_list = []
    for i, sig in enumerate(signature_test):
        # search potential nearest neighbors
        neighbors = []
        for j, key in enumerate(sig):
            if key in hash_tables[j]:
                neighbors += hash_tables[j][key]
        values, counts = np.unique(neighbors, return_counts = True) 
        # compute actual nearest neighbors
        if len(values) >= k:
            indices = values[np.argsort(-counts)][:factor * k]
            unit_vector = np.zeros(X_test.shape[0])
            unit_vector[i] = 1
            x_test = X_test.T.dot(unit_vector)
            nn_list.append(indices[np.argsort(-jaccard_similarity(x_test, X_train[indices]))[:k]])
        else:
            # complete with random values if necessary
            nn_list.append(np.array(list(values) + list(np.random.choice(X_train.shape[0], size = k - len(values)))))
    return np.array(nn_list)

In [24]:
def knn_classifier(nn, y_train):
    '''Classification based on list of k-nearest neighbors.
    
    Parameters
    ----------
    nn_list : np.ndarray
        k-nearest neighbors, as arrays of shape (n_test, k).
        Each entry is in range(n_train)
    y_train : np.ndarray
        Target labels of the train set, array of shape (n_train,).
        
    Returns
    -------
    y_pred : np.ndarray
        Predicted labels of the test set, array of shape (n_test,).
    '''    
    y_pred = []
    for nn_ in nn:
        labels, counts = np.unique(y_train[nn_], return_counts=True)
        y_pred.append(labels[np.argmax(counts)])
    return np.array(y_pred)

## Questions

Unless otherwise specified, the considered metric is the [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index).

1. Complete the ``jaccard_similarity`` function. Make sure that your code does not produce any dense matrix.<br>
What is the nearest neighbor of the following sentence in the training set? What is the corresponding newsgroup?

In [25]:
sentence = "Ice hockey is a team sport played on ice, in which two teams of skaters use their sticks to shoot a puck into their opponent's net to score points."

In [26]:
print(sentence)

Ice hockey is a team sport played on ice, in which two teams of skaters use their sticks to shoot a puck into their opponent's net to score points.


In [27]:
# sparse matrix
X_sample = tf_vectorizer.transform([sentence])
X_sample.data = (X_sample.data > 0)

In [28]:
# dense vector
x_sample = np.array(X_sample.todense()).flatten()

In [29]:
jaccard_similarity(x_sample, X_train)

array([0.        , 0.        , 0.01515152, ..., 0.        , 0.01234568,
       0.        ])

2. Find the common words between the above sentence and its nearest neighbor.

In [30]:
feat_names = tf_vectorizer.get_feature_names()

In [31]:
nn = np.argmax(jaccard_similarity(x_sample, X_train))
print(dataset_train.data[nn])

From: chuck@mks.com (Chuck Lownie)
Subject: Re: Tie Breaker....(Isles and Devils)
Organization: Mortice Kern Systems Inc., Waterloo, Ontario, CANADA
Lines: 27

In article <lrw509f@rpi.edu> wangr@rpi.edu writes:
>	Are people here stupid or what??? It is a tie breaker, of cause they
>have to have the same record. How can people be sooooo stuppid to put win as
>first in the list for tie breaker??? If it is a tie breaker, how can there be
>different record???? Man, I thought people in this net are good with hockey.
>I might not be great in Math, but tell me how can two teams ahve the same points
>with different record??? Man...retard!!!!!! Can't believe people actually put
>win as first in a tie breaker......
>
>


I didn't see any smilies in this message so.......

                W     T    L    PTs
   Team A      50    30    4    104
   Team B      52    32    0    104


There you go.  Two teams that tie in points without identical records.


-- 







In [32]:
nearest = np.array(X_train[nn].todense().ravel())

In [33]:
word_index = np.argwhere((nearest * x_sample) == 1)[:,1]
for i in word_index:
    print(feat_names[i])

hockey
net
points
team
teams


3. Complete the function ``get_signature``.<br>
Get 3 nearest neighbors of the above sentence using Min Hash with 100 permutations.<br>
Display the corresponding newsgroups.

In [34]:
d = X_train.shape[1]
permutations = get_permutations(d)

In [35]:
permutations.shape

(100, 25614)

In [None]:
signature_train = get_signature(X_train, permutations)
hash_tables = get_hash_tables(signature_train)

In [None]:
signature_sample = get_signature(X_sample, permutations)

In [None]:
nn = get_approximate_knn(signature_sample, hash_tables, X_train, X_sample, k = 3, factor = 10)
y_pred = knn_classifier(nn, y_train)

print('Corresponding news group:', target_names[y_pred[0]])

4. What is the accuracy of (approximate) 3-nn classification using Min Hash with 100 permutations?<br>
Compare with the exact 3-nn classification based on (a) the Hamming distance, (b) the cosine similarity after SVD.<br>
Comment the results.

In [None]:
signature_test = get_signature(X_test, permutations)
nn = get_approximate_knn(signature_test, hash_tables, X_train, X_test, k = 3, factor = 10)

In [None]:
from sklearn.metrics import accuracy_score

y_pred = knn_classifier(nn, y_train)
accuracy_score(y_test,y_pred)