<a href="https://colab.research.google.com/github/Maheshcheegiti/dwdm-lab/blob/main/DWDM_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# connecting to google drive for datasets
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###Experiment that demonstrates the similarity and dissimilarity measures using Python: Dataset: "Iris" dataset from Scikit-learn library
>Step 1: Loading the dataset

>To load the Iris dataset, we first need to import the required libraries and load the dataset as follows:


In [None]:
import pandas as pd

# load the dataset
iris = pd.read_csv("/content/drive/MyDrive/datasets-master/Iris.csv")

# convert it into a pandas dataframe
data = pd.DataFrame(iris)

data.head()

target = data['Species']

data = data.drop('Species', axis=1)

>Step 2: Cosine Similarity

>Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. In this example, we can use cosine similarity to measure the similarity between two iris samples based on their four features. We can calculate cosine similarity using the cosine_similarity() function from the Scikit-learn library as follows:


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(data.iloc[[0]], data.iloc[[1]])

print(cos_sim)

[[0.984316]]


>Step 3: Euclidean Distance
Euclidean distance is a measure of the distance between two points in a multi-dimensional space. In this example, we can use Euclidean distance to measure the dissimilarity between two iris samples based on their four features. We can calculate Euclidean distance using the euclidean_distances() function from the Scikit-learn library as follows:


In [None]:
from sklearn.metrics.pairwise import euclidean_distances

euclidean_dist = euclidean_distances(data.iloc[[0]], data.iloc[[1]])

print(euclidean_dist)

[[1.13578167]]


>Step 4: Pearson's Correlation
Pearson's correlation is a measure of the linear relationship between two variables. In this example, we can use Pearson's correlation to measure the correlation between the 'petal length' and 'petal width' variables in the iris dataset. We can calculate Pearson's correlation using the Pearson() function from the SciPy library as follows:


In [None]:
from scipy.stats import pearsonr

# calculate the pearson's correlation coefficent btw 'petal length' and 'petal width'
corr, _ = pearsonr(data['PetalLengthCm'], data['PetalWidthCm'])

print(corr)

0.9627570970509661


> Step 5: Jaccard Similarity
Jaccard similarity is a measure of similarity between two sets. In this example, we can use Jaccard similarity to measure the similarity between the first and second samples based on their target variable. We can calculate Jaccard similarity using the Jaccard_score() function from the Scikit-learn library as follows:


In [None]:
from sklearn.preprocessing import LabelEncoder

# Instantiate the label encoder
label_encoder = LabelEncoder()

# Fit and transform the target variable
target_numerical = label_encoder.fit_transform(target)

# Replace the original 'Species' column with the numerical labels
data['Species'] = target_numerical

# Display the updated dataframe
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


In [None]:
from sklearn.metrics import jaccard_score

# Select two rows from the 'Species' column and convert them to lists
label1 = [data.iloc[0]['Species']]
label2 = [data.iloc[1]['Species']]

# Compute the Jaccard similarity score
jaccard_sim = jaccard_score(label1, label2)

print(jaccard_sim)

0.0


  _warn_prf(average, modifier, msg_start, len(result))


> Step 6: Manhattan distance
To calculate the Manhattan distance between two vectors, we can use the manhattan_distances() function from the Scikit-learn library as follows:


In [None]:
from sklearn.metrics.pairwise import manhattan_distances

data = data.drop('Species', axis=1)

# Compute the Manhattan distance
manhattan_distance = manhattan_distances(data.iloc[[0]], data.iloc[[1]])

print(manhattan_distance)

[[1.7]]


###Output:
>Here, we can see that the cosine similarity and Pearson's correlation coefficient are very high, indicating that the first two rows are very similar. The Euclidean distance and Manhattan distance are relatively low, indicating that thevectors are relatively close to each other. However, the Jaccard similarity is 0, indicating that the sets of values in the two rows are completely dissimilar.
