## Distance Metrics

- Fundamental Concept: Distance metrics quantify the similarity or dissimilarity between objects in a dataset. They provide a numerical measure of how close or far apart data points are in a multidimensional space.

#### Importing necessary Libraries

- NumPy (np): Fundamental package for numerical computing with support for arrays and mathematical functions.
- Pandas (pd): Data manipulation and analysis library providing DataFrame and Series structures for structured data handling.
- Matplotlib (plt): Comprehensive plotting library for creating 2D graphics with fine-grained control.
- Seaborn (sns): Statistical data visualization library offering high-level interfaces for creating attractive and informative plots.
- scikit-learn (sklearn): Machine learning library featuring various algorithms for classification, regression, clustering, and more, along with utilities for data preprocessing and model evaluation.

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances, pairwise_distances # for calculating the distances 
from sklearn.metrics import jaccard_score # for calculating the jaccard_score

#### 1. Write a Python program to demonstrate the calculation of the following distance measures for various variables and any sample dataset.
- i. Euclidean Distance
- ii. Manhattan Distance
- iii. Minkowski distance

#### Euclidean distance
- Euclidean distance is a measure of the straight-line distance between two points in Euclidean space.
- It's calculated as the square root of the sum of the squared differences between corresponding elements of two vectors.
- Commonly used in clustering algorithms like K-means, hierarchical clustering, and in dimensionality reduction techniques like PCA (Principal Component Analysis).

#### Manhattan distance
- Manhattan distance, also known as city block distance or taxicab distance, calculates the distance between two points in a grid based on the sum of the absolute differences of their coordinates.
- It measures the distance you'd have to travel along the grid to reach one point from the other, moving horizontally and vertically.
- It's often used in classification algorithms like K-nearest neighbors (KNN).

In [102]:
# Loading Iris dataset from seaborn
df=sns.load_dataset('iris')
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


#### Iris dataset
- Contents: It consists of 150 samples of iris flowers, each belonging to one of three species: Setosa, Versicolor, and Virginica. There are 50 samples for each species.

- Features: For each flower sample, four features are measured:

    - Sepal length (in cm)
    - Sepal width (in cm)
    - Petal length (in cm)
    - Petal width (in cm)
- Target Variable: The target variable is the species of iris flower, which is categorical and consists of three classes: Setosa, Versicolor, and Virginica.

In [103]:
# Extracting the first four columns of the DataFrame 'df' to create a feature matrix 'X'.
X=df.iloc[:,:4]

In [104]:
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [105]:
# Taking samples for calculating distances manually
x1=X.iloc[0]
x2=X.iloc[20]

In [106]:
x1

sepal_length    5.1
sepal_width     3.5
petal_length    1.4
petal_width     0.2
Name: 0, dtype: float64

#### Manual calculation of distances

In [11]:
# Euclidean distance

In [107]:
euclidean_dist = np.sqrt(np.sum((x1 - x2) ** 2))
print("The Euclidean distance between point 1 and point 2 is: {:.2f}".format(euclidean_dist))

The Euclidean distance between point 1 and point 2 is: 0.44


In [108]:
# Manhattan Distance

In [109]:
manhattan_dist = np.sum(np.abs(x1 - x2))
print("The Manhattan distance between point 1 and point 2 is: {:.2f}".format(manhattan_dist))

The Manhattan distance between point 1 and point 2 is: 0.70


In [110]:
# Minkowski distance

In [112]:
p = int(input("Enter the parameter that defines the distance metric: ")) # the parameter which defines the distance metric
minkowski_dist = np.power(np.sum(np.abs(x1 - x2) ** p), 1/p)
print("The  Minkowski distance between point 1 and point 2 is: {:.2f}".format(minkowski_dist))

Enter the parameter that defines the distance metric: 3
The  Minkowski distance between point 1 and point 2 is: 0.38


#### Using scikit learn

In [113]:
# Euclidean distance

In [114]:
euclidean_dist = pairwise_distances(X, metric='euclidean')
print("Euclidean Distance Matrix (scikit-learn):\n", euclidean_dist)

Euclidean Distance Matrix (scikit-learn):
 [[0.         0.53851648 0.50990195 ... 4.45982062 4.65080638 4.14004831]
 [0.53851648 0.         0.3        ... 4.49888875 4.71805044 4.15331193]
 [0.50990195 0.3        0.         ... 4.66154481 4.84871117 4.29883705]
 ...
 [4.45982062 4.49888875 4.66154481 ... 0.         0.6164414  0.64031242]
 [4.65080638 4.71805044 4.84871117 ... 0.6164414  0.         0.76811457]
 [4.14004831 4.15331193 4.29883705 ... 0.64031242 0.76811457 0.        ]]


In [115]:
# Manhattan distance

In [116]:
manhattan_dist_sklearn = pairwise_distances(X, metric='manhattan')
print("\nManhattan Distance Matrix (scikit-learn):\n", manhattan_dist_sklearn)


Manhattan Distance Matrix (scikit-learn):
 [[0.  0.7 0.8 ... 7.5 7.3 6.6]
 [0.7 0.  0.5 ... 7.2 7.8 6.3]
 [0.8 0.5 0.  ... 7.7 7.9 6.8]
 ...
 [7.5 7.2 7.7 ... 0.  1.2 0.9]
 [7.3 7.8 7.9 ... 1.2 0.  1.5]
 [6.6 6.3 6.8 ... 0.9 1.5 0. ]]


In [117]:
# Minkowski distance

In [119]:
p = int(input("Enter the parameter that defines the distance metric: ")) # taking input of the parameter which defines the distance metric
minkowski_dist_sklearn = pairwise_distances(X, metric='minkowski', p=p)
print("\nMinkowski Distance Matrix (p=3, scikit-learn):\n", minkowski_dist_sklearn)

Enter the parameter that defines the distance metric: 3

Minkowski Distance Matrix (p=3, scikit-learn):
 [[0.         0.51044687 0.45143574 ... 3.99108431 4.20952111 3.81182833]
 [0.51044687 0.         0.25712816 ... 4.0165977  4.22692453 3.82013778]
 [0.45143574 0.25712816 0.         ... 4.14064278 4.33678555 3.93011963]
 ...
 [3.99108431 4.0165977  4.14064278 ... 0.         0.50132979 0.6082202 ]
 [4.20952111 4.22692453 4.33678555 ... 0.50132979 0.         0.62402515]
 [3.81182833 3.82013778 3.93011963 ... 0.6082202  0.62402515 0.        ]]


#### Importance of distance metrics

- In machine learning, distance metrics are fundamental for clustering algorithms like K-means, hierarchical clustering, and DBSCAN, where they determine how data points are grouped together based on their proximity. Similarly, in classification algorithms like K-nearest neighbors (KNN), distance metrics are used to identify the nearest neighbors of a data point and make predictions.
- Distance metrics are used in feature selection and dimensionality reduction techniques like principal component analysis (PCA) and multidimensional scaling (MDS). They help in identifying important features or reducing the dimensionality of the dataset while preserving the underly
- Distance metrics are integral to optimization and search algorithms, such as gradient descent and simulated annealing. They guide the search process by determining the distance between candidate solutions and the target or optimal solution.

#### Real life Applications

- Recommender Systems: Calculating similarities between users or items to make personalized recommendations.
- Image and Text Retrieval: Measuring the similarity between images or documents for efficient retrieval.
- Bioinformatics: Analyzing genetic sequences and protein structures to identify similarities and differences.
- Geographic Information Systems (GIS): Computing distances between locations for route planning and spatial analysis.
- Customer Segmentation: Grouping customers based on their purchasing behavior or demographic similarities.

#### Conclusion
Euclidean Distance (0.44):

The Euclidean distance is relatively low compared to the other metrics, indicating that the points in the dataset are closer to each other when measured using the Euclidean metric.
Euclidean distance tends to emphasize the importance of differences in all dimensions equally.

Manhattan Distance (0.78):

The Manhattan distance is higher compared to the Euclidean distance, suggesting that the points are farther apart when measured using the Manhattan metric.
Manhattan distance considers differences along each dimension independently, which can lead to higher distances when the dimensions have significant differences.
Minkowski Distance with  p=3 (0.38):

The Minkowski distance with 
p=3 is the lowest among the three metrics, indicating that it measures distances more conservatively compared to the other metrics.
Minkowski distance with p=3 strikes a balance between Euclidean and Manhattan distances, as it considers differences in all dimensions but with less emphasis on extreme differences.

#### Inferences for the Iris dataset based on these distances may include:

The dataset points are relatively close to each other when measured using the Euclidean distance, suggesting that the features may exhibit some degree of correlation or similarity.
The higher Manhattan distance implies that there are significant differences between points along individual dimensions, possibly indicating variations in feature values across samples.
The lower Minkowski distance with p=3 suggests that differences between points are moderate, reflecting a more balanced consideration of all dimensions with less sensitivity to extreme differences.

- Finally it is The choice of distance metric depends on the nature of the data and the problem at hand. Different distance metrics have different properties and may be more suitable for certain applications

#### 2. Write a Python program to demonstrate the use of the Jaccard Distance metric on Boolean-valued vectors and any sample dataset.

- DataFrame df1 with 10 samples and 5 features, where each feature contains randomly generated binary values (0 or 1). The DataFrame is labeled with column names 'Feature_1' to 'Feature_5' to represent each feature.

In [121]:
# Randomly generating a boolean value dataset
num_samples = 10
num_features = 5
# np.random.randint generates random integers from a specified low (inclusive) to high (exclusive) range.
sample_data = np.random.randint(2, size=(num_samples, num_features))
df1 = pd.DataFrame(sample_data, columns=[f'Feature_{i+1}' for i in range(num_features)])

In [122]:
df1

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,0,1,1,1,1
1,1,0,1,1,0
2,1,0,0,1,0
3,0,1,1,1,1
4,1,1,1,1,1
5,0,0,0,0,0
6,0,1,0,1,0
7,0,0,1,1,1
8,0,0,0,0,0
9,1,0,1,1,0


#### Jaccard distances: 
- The Jaccard distance is a measure of dissimilarity between two sets. It quantifies how different two sets are by comparing their intersection to their union.
- The Jaccard distance ranges from 0 to 1, with 0 indicating complete similarity (when the sets are identical) and 1 indicating complete dissimilarity (when the sets have no elements in common).
- Jaccard distance is particularly useful for binary or categorical data where elements are either present or absent. It's commonly applied in text analysis, recommendation systems, and clustering tasks.
- Jaccard distance measures the proportion of elements that are different between two sets relative to the total number of unique elements in both sets. It's unaffected by the size of the sets or the absolute number of elements.

In [123]:
# Calulating jaccard distances
jaccard_distances = []
index_pairs = []
for i in range(len(df1)):
    for j in range(i + 1, len(df1)):
        jaccard_distance = round(1 - jaccard_score(df1.iloc[i], df1.iloc[j]), 2)
        index_pairs.append((i+1, j+1))
        jaccard_distances.append(jaccard_distance)

  _warn_prf(average, modifier, msg_start, len(result))


- The above code iterates through all pairs of samples in the DataFrame df1, computes the Jaccard distance between each pair using scikit-learn's jaccard_score function, and stores the distances along with the corresponding index pairs in jaccard_distances and index_pairs lists, respectively.

In [126]:
jaccard_df = pd.DataFrame({'Indexes(Sample)': index_pairs, 'Jaccard Distance': jaccard_distances})

In [127]:
jaccard_df

Unnamed: 0,Indexes(Sample),Jaccard Distance
0,"(1, 2)",0.6
1,"(1, 3)",0.8
2,"(1, 4)",0.0
3,"(1, 5)",0.2
4,"(1, 6)",1.0
5,"(1, 7)",0.5
6,"(1, 8)",0.25
7,"(1, 9)",1.0
8,"(1, 10)",0.6
9,"(2, 3)",0.33


#### Conclusion
- The Jaccard distances range from 0.00 to 1.00, representing the dissimilarity between pairs of samples.
- Samples with lower distances (closer to 0) indicate higher similarity, while those with higher distances (closer to 1) represent greater dissimilarity.
- Pairs with Jaccard distance of 1.00 indicate complete dissimilarity, where the two samples have no common elements.
- The dataset contains pairs with varying degrees of similarity, reflecting the diversity of the samples in the dataset.
- Interpretation of the distances should consider the context of the dataset and the specific application. 
- High distances may indicate meaningful differences or noise depending on the problem domain.