## Supervised Machine Learning – Practice Notebook

- This notebook demonstrates core supervised learning concepts using Python.
Topics covered include:
    - Data preprocessing
    - Train-test split
    - Model training
    - Prediction
    - Model evaluation

- This notebook is intended for learning and hands-on practice.


### Similarity Measures in Machine Learning
- Similarity (or distance) measures help quantify how close or far two data points are.  
They are widely used in:
    - Clustering  
    - Recommendation systems  
    - Nearest-neighbor search  
    - Text similarity  
    - Anomaly detection  

- This notebook demonstrates various similarity/distance metrics using small text examples.

---

### Converting Text into Numerical Vectors

- Many similarity measures require numerical vectors as input.  
- We will convert two simple text sentences into bag-of-words vectors.

In [1]:
# Euclidean distance
doc= ['dogs like running', 'cats like napping']
doc

['dogs like running', 'cats like napping']

In [10]:
# Convert these documents into vectors
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
cvect = CountVectorizer()
cvect

In [4]:
vec = cvect.fit_transform(doc)
vec

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [5]:
vec.toarray()

array([[0, 1, 1, 0, 1],
       [1, 0, 1, 1, 0]], dtype=int64)

In [6]:
import pandas as pd

In [7]:
df = pd.DataFrame(vec.toarray(), columns = cvect.get_feature_names_out())
df

Unnamed: 0,cats,dogs,like,napping,running
0,0,1,1,0,1
1,1,0,1,1,0


### Euclidean distance

- Euclidean distance represents the straight-line distance between two points in space.
- It is the most common distance used in machine learning.

In [14]:
from sklearn.metrics import pairwise_distances 
from sklearn.metrics.pairwise import euclidean_distances

In [15]:
import numpy as np

In [16]:
arr = np.identity(3)
arr

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [17]:
pairwise_distances(arr)

array([[0.        , 1.41421356, 1.41421356],
       [1.41421356, 0.        , 1.41421356],
       [1.41421356, 1.41421356, 0.        ]])

In [18]:
euclidean_distances(arr)

array([[0.        , 1.41421356, 1.41421356],
       [1.41421356, 0.        , 1.41421356],
       [1.41421356, 1.41421356, 0.        ]])

In [19]:
x = arr[0, :]
x

array([1., 0., 0.])

In [20]:
y = arr[1:, :]
y

array([[0., 1., 0.],
       [0., 0., 1.]])

In [21]:
pairwise_distances(x.reshape(1, -1), y)

array([[1.41421356, 1.41421356]])

In [22]:
vec.toarray()

array([[0, 1, 1, 0, 1],
       [1, 0, 1, 1, 0]], dtype=int64)

In [23]:
pairwise_distances(vec.toarray())

array([[0., 2.],
       [2., 0.]])

In [24]:
euclidean_distances(vec.toarray())

array([[0., 2.],
       [2., 0.]])

In [25]:
pairwise_distances(X = vec.toarray()[0].reshape(1, -1), Y = vec.toarray()[1].reshape(1, -1))

array([[2.]])

## Manhattan Distance
- Manhattan distance adds the absolute differences between coordinates.
- Useful when movement is grid-like or when features are sparse.

In [26]:
# Manhattan Distance
from sklearn.metrics.pairwise import manhattan_distances

In [28]:
manhattan_distances(vec.toarray())

array([[0., 4.],
       [4., 0.]])

In [29]:
pairwise_distances(vec.toarray(),metric='manhattan')

array([[0., 4.],
       [4., 0.]])

### minkowski Distance
- Minkowski distance generalizes Euclidean and Manhattan distances via parameter p:
    - p = 1 → Manhattan distance
    - p = 2 → Euclidean distance
    - p = 3 → Higher-order distance

In [30]:
pairwise_distances(vec.toarray(), metric = 'minkowski', p = 1)

array([[0., 4.],
       [4., 0.]])

In [31]:
pairwise_distances(vec.toarray(), metric = 'minkowski', p = 1)

array([[0., 4.],
       [4., 0.]])

In [32]:
pairwise_distances(vec.toarray(), metric = 'minkowski', p = 2)

array([[0., 2.],
       [2., 0.]])

In [33]:
pairwise_distances(vec.toarray(), metric = 'minkowski', p = 3)

array([[0.        , 1.58740105],
       [1.58740105, 0.        ]])

### Cosine similarity
- Cosine similarity measures the angle between vectors.
- Useful for text similarity because it ignores magnitude and focuses on direction.

In [57]:

from sklearn.metrics.pairwise import cosine_similarity

In [58]:
cosine_similarity(vec.toarray())

array([[1.        , 0.33333333],
       [0.33333333, 1.        ]])

### Cosine distance
- Cosine distance = 1 − cosine similarity.

In [36]:
# Cosine distance
from sklearn.metrics.pairwise import cosine_distances

In [37]:
cosine_distances(vec.toarray())

array([[0.        , 0.66666667],
       [0.66666667, 0.        ]])

### Angular Distance - Calculated Cosine similarity
- Angular distance = arccos(cosine similarity), expressed in radians or degrees.
- It provides angle-based separation between vectors.

In [59]:

np.arccos(cosine_similarity(vec.toarray()))

  np.arccos(cosine_similarity(vec.toarray()))


array([[       nan, 1.23095942],
       [1.23095942,        nan]])

In [39]:
np.degrees(np.arccos(cosine_similarity(vec.toarray())))

  np.degrees(np.arccos(cosine_similarity(vec.toarray())))


array([[        nan, 70.52877937],
       [70.52877937,         nan]])

### Hamming Distance
- Hamming distance counts the number of positions where two vectors differ.
- Common in error detection and binary string comparison.

In [40]:
# Hamming distance
pairwise_distances(vec.toarray(), metric = 'hamming')

array([[0. , 0.8],
       [0.8, 0. ]])

### Jaccard Score
- Jaccard similarity = intersection / union of binary features.
- Useful for sets or sparse binary vectors.

In [41]:
# Jaccard Score
from sklearn.metrics import jaccard_score

In [42]:
jaccard_score(vec.toarray()[0], vec.toarray()[1])

0.2

### Levenshtein Distance (Edit Distance)
- Levenshtein distance counts the minimum number of edits
- (insertions, deletions, substitutions) needed to transform one word into another.

In [43]:
# Levenshtein distance
!pip install Levenshtein


Collecting Levenshtein
  Downloading levenshtein-0.27.3-cp312-cp312-win_amd64.whl.metadata (3.7 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein)
  Downloading rapidfuzz-3.14.3-cp312-cp312-win_amd64.whl.metadata (12 kB)
Downloading levenshtein-0.27.3-cp312-cp312-win_amd64.whl (94 kB)
Downloading rapidfuzz-3.14.3-cp312-cp312-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 3.4 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 3.7 MB/s eta 0:00:00
Installing collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.27.3 rapidfuzz-3.14.3


In [44]:
import Levenshtein

In [45]:
Levenshtein.distance('between', 'betweens')

1

In [46]:
Levenshtein.distance('compute', 'computer')


1

In [47]:
Levenshtein.distance('computing', 'computer')

3

In [48]:
Levenshtein.distance('computation', 'computer')

5

## Summary
- This notebook covered major similarity and distance metrics used in machine learning:
    - Euclidean
    - Manhattan
    - Minkowski
    - Cosine similarity & distance

Angular distance

Hamming

Jaccard

Levenshtein

These measures are fundamental in clustering, NLP, recommendation engines, and nearest-neighbor algorithms.