# COMP 484 - Practical Assignment 3

#### Ramraj Chimouriya
#### CE IV/I

## Chapter 6 - Clustering – Finding Related Posts

___

### Converting raw test into a bag of words

Scikit's CounterVectorizer method count words and represent those counts as a vector.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

Consider the following examples:

In [5]:
content = ["How to format my hard disk",
           "Hard disk format problems"]

X = vectorizer.fit_transform(content)
vectorizer.get_feature_names_out()

array(['disk', 'format', 'hard', 'how', 'my', 'problems', 'to'],
      dtype=object)

In [6]:
print(X.toarray().transpose())

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


### Counting words

Reading posts from data/toy directory and feeding to CountVectorizer.

In [13]:
from pathlib import Path
TOY_DIR = Path("data/toy")
posts = []
for fn in TOY_DIR.iterdir():
    with open(fn, 'r') as f:
        posts.append(f.read())

In [14]:
vectorizer = CountVectorizer(min_df=1)

In [15]:
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print(f"samples: {num_samples} \t features:{num_features}")

samples: 5 	 features:25


In [17]:
print(vectorizer.get_feature_names_out())

['about' 'actually' 'capabilities' 'contains' 'data' 'databases' 'images'
 'imaging' 'interesting' 'is' 'it' 'learning' 'machine' 'most' 'much'
 'not' 'permanently' 'post' 'provide' 'save' 'storage' 'store' 'stuff'
 'this' 'toy']


Now, we can vectorize our new post:

In [18]:
new_post = "Imaging databases"
new_post_vec = vectorizer.transform([new_post])

In [20]:
print(new_post_vec)

  (0, 5)	1
  (0, 7)	1


In [23]:
print(new_post_vec.toarray())

[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


For the similarity measurement (the naive one), we calculate the Euclidean distance between the count vectors of the new post and all the old posts.

In [24]:
import scipy
def dist_raw(v1, v2):
    delta = v1-v2
    return scipy.linalg.norm(delta.toarray())

Function that takes the current dataset and the new post in vectorized form as well as a distance function and prints out an analysis of how well the distance function works:

In [43]:
def best_post(X, new_vec, dist_func):
    best_doc = None
    best_dist = float('inf')    #infinite value as a starting point
    best_i = None
    for i, post in enumerate(posts):
        if post == new_post:
            continue
        post_vec = X.getrow(i)
        d = dist_func(post_vec, new_vec)
        print(f"=== Post {i} with dist={round(d,2)}: \n'{post}'")
        if d < best_dist:
            best_dist = d
            best_i = i
    print(f"===> Best post is {best_i} with dist={round(best_dist,2)}")

In [44]:
best_post(X_train, new_post_vec, dist_raw)

=== Post 0 with dist=1.41: 
'Imaging databases store data.'
=== Post 1 with dist=2.0: 
'Most imaging databases save images permanently.
'
=== Post 2 with dist=1.73: 
'Imaging databases provide storage capabilities.'
=== Post 3 with dist=4.0: 
'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
=== Post 4 with dist=5.1: 
'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
===> Best post is 0 with dist=1.41


### Normalizing word count vectors`

In [45]:
def dist_norm(v1, v2):
    v1_normalized = v1 / scipy.linalg.norm(v1.toarray())
    v2_normalized = v2 / scipy.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return scipy.linalg.norm(delta.toarray())

In [46]:
best_post(X_train, new_post_vec, dist_norm)

=== Post 0 with dist=0.77: 
'Imaging databases store data.'
=== Post 1 with dist=0.92: 
'Most imaging databases save images permanently.
'
=== Post 2 with dist=0.86: 
'Imaging databases provide storage capabilities.'
=== Post 3 with dist=1.41: 
'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
=== Post 4 with dist=0.77: 
'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
===> Best post is 0 with dist=0.77


### Removing less important words