# Eucledian Distance, 
This metric is the most common distance measure and is defined as the straight-line distance between two points in a multi-dimensional space. It works well with interval and ratio-scaled data.

### Importing required Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from scipy.spatial.distance import euclidean

### Load Datasets

In [4]:
# Load datasets
adult_df = pd.read_csv("../adult/adult_trim.data", header=None) # No header
titanic_df = pd.read_csv('../titanic/titanic_trim.csv') # Has header

# Rename columns for clarity
adult_df.columns = ["age", "workclass", "fnlwgt", "education", "education_num", 
                    "marital_status", "occupation", "relationship", "race", "sex", 
                    "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"]
adult_df.dropna(inplace=True)

In [5]:
adult_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,29,Local-gov,115585,Some-college,10,Never-married,Handlers-cleaners,Not-in-family,White,Male,0,0,50,United-States,<=50K
96,48,Self-emp-not-inc,191277,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
97,37,Private,202683,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,>50K
98,48,Private,171095,Assoc-acdm,12,Divorced,Exec-managerial,Unmarried,White,Female,0,0,40,England,<=50K


In [6]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
150,151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.5250,,S
151,152,1,1,"Pears, Mrs. Thomas (Edith Wearne)",female,22.0,1,0,113776,66.6000,C2,S
152,153,0,3,"Meo, Mr. Alfonzo",male,55.5,0,0,A.5. 11206,8.0500,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S


### Select relevant columns from Adult dataset (mix of nominal and ratio-scaled)

In [7]:
adult_df = adult_df[["age", "workclass", "education", "education_num", "sex"]]

adult_df

Unnamed: 0,age,workclass,education,education_num,sex
0,39,State-gov,Bachelors,13,Male
1,50,Self-emp-not-inc,Bachelors,13,Male
2,38,Private,HS-grad,9,Male
3,53,Private,11th,7,Male
4,28,Private,Bachelors,13,Female
...,...,...,...,...,...
95,29,Local-gov,Some-college,10,Male
96,48,Self-emp-not-inc,Doctorate,16,Male
97,37,Private,Some-college,10,Male
98,48,Private,Assoc-acdm,12,Female


### Encode nominal attributes as integers for processing

In [8]:
label_encoders = {}
for column in adult_df.columns:
    if adult_df[column].dtype == object:
        le = LabelEncoder()
        adult_df[column] = le.fit_transform(adult_df[column])
        label_encoders[column] = le

adult_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])


Unnamed: 0,age,workclass,education,education_num,sex
0,39,6,7,13,1
1,50,5,7,13,1
2,38,3,9,9,1
3,53,3,1,7,1
4,28,3,7,13,0
...,...,...,...,...,...
95,29,2,12,10,1
96,48,5,8,16,1
97,37,3,12,10,1
98,48,3,5,12,0


### Clean and preprocess Titanic dataset

In [9]:
titanic_df.dropna(inplace=True)
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


### Select relevant columns from Titanic dataset (mix of nominal and ratio-scaled)

In [10]:
titanic_df = titanic_df[["Age", "Sex", "Pclass", "Fare", "Embarked"]]
titanic_df

Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,female,1,71.2833,C
3,35.0,female,1,53.1,S
6,54.0,male,1,51.8625,S
10,4.0,female,3,16.7,S
11,58.0,female,1,26.55,S
21,34.0,male,2,13.0,S
23,28.0,male,1,35.5,S
27,19.0,male,1,263.0,S
52,49.0,female,1,76.7292,C
54,65.0,male,1,61.9792,C


### Encode Nominal as Integers for processing

In [11]:
label_encoders_titanic = {}
for column in titanic_df.columns:
    if titanic_df[column].dtype == object:
        le = LabelEncoder()
        titanic_df[column] = le.fit_transform(titanic_df[column])
        label_encoders[column] = le

titanic_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])


Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,0,1,71.2833,0
3,35.0,0,1,53.1,1
6,54.0,1,1,51.8625,1
10,4.0,0,3,16.7,1
11,58.0,0,1,26.55,1
21,34.0,1,2,13.0,1
23,28.0,1,1,35.5,1
27,19.0,1,1,263.0,1
52,49.0,0,1,76.7292,0
54,65.0,1,1,61.9792,0


### Combine the datasets into a list for further processing

In [12]:

# Combine the datasets into a list for further processing
datasets = {
    "Adult Dataset": adult_df,
    "Titanic Dataset": titanic_df
}

### Compute Eucledian Distance

In [13]:
def euclidean_distance(a, b):
    """Calculate the Euclidean Distance between two vectors."""
    try:
        return euclidean(a, b)
    except Exception as e:
        return np.nan

# Function to create the Euclidean Distance matrix
def calculate_euclidean_matrix(dataset):
    n = len(dataset)
    euclidean_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            euclidean_matrix[i, j] = euclidean_distance(dataset.iloc[i].values, dataset.iloc[j].values)
    
    return pd.DataFrame(euclidean_matrix)

### Calculate Eucledian Distance

#### For Adult Dataset

In [14]:
euclidean_matrix_adult = calculate_euclidean_matrix(adult_df)
euclidean_matrix_adult

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.000000,11.045361,5.477226,16.643317,11.445523,4.898979,13.527749,13.784049,9.165151,4.242641,...,18.384776,6.928203,12.845233,10.535654,6.403124,12.247449,9.591663,6.855655,9.797959,9.695360
1,11.045361,0.000000,12.961481,9.219544,22.113344,13.564660,8.888194,4.898979,19.390719,8.246211,...,7.615773,14.422205,22.869193,20.615528,16.278821,22.000000,3.741657,14.387495,3.741657,18.973666
2,5.477226,12.961481,0.000000,17.117243,11.000000,5.291503,12.767145,14.142136,8.717798,6.000000,...,19.339080,3.464102,10.535654,8.062258,6.082763,9.591663,12.409674,3.316625,11.224972,6.324555
3,16.643317,9.219544,17.117243,0.000000,26.419690,19.672316,5.477226,8.544004,24.799194,13.892444,...,7.549834,19.672316,27.495454,24.454039,20.832667,26.589472,12.609520,19.646883,8.185353,22.649503
4,11.445523,22.113344,11.000000,26.419690,0.000000,9.539392,22.671568,24.515301,4.358899,14.035669,...,29.103264,10.723805,5.830952,4.898979,6.164414,6.082763,20.371549,10.770330,20.124612,6.403124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,12.247449,22.000000,9.591663,26.589472,6.082763,9.273618,22.158520,23.409400,5.099020,14.282857,...,28.670542,8.124038,1.732051,3.605551,7.681146,0.000000,20.542639,8.062258,20.396078,4.472136
96,9.591663,3.741657,12.409674,12.609520,20.371549,11.575837,11.958261,8.124038,17.378147,7.071068,...,10.677078,13.341664,21.377558,19.467922,14.662878,20.542639,0.000000,13.304135,5.477226,17.944358
97,6.855655,14.387495,3.316625,19.646883,10.770330,4.582576,15.297059,15.459625,7.549834,7.681146,...,20.904545,1.000000,9.055385,7.745967,6.633250,8.062258,13.304135,0.000000,13.228757,6.244998
98,9.797959,3.741657,11.224972,8.185353,20.124612,12.247449,7.141428,6.782330,17.832555,6.480741,...,9.165151,13.190906,21.283797,18.681542,14.247807,20.396078,5.477226,13.228757,0.000000,16.911535


#### For Titanic Dataset

In [15]:
euclidean_matrix_titanic = calculate_euclidean_matrix(titanic_df)
euclidean_matrix_titanic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,0.0,18.456229,25.202529,64.345448,49.010898,58.44607,37.181239,192.661083,12.274275,28.575624,...,18.084535,21.327111,176.795521,58.559312,17.147898,48.856934,18.265607,16.114408,45.341231,16.701296
1,18.456229,0.0,19.0665,47.853527,35.126948,40.137389,18.96734,210.511306,27.483433,31.318368,...,27.964891,12.091733,194.736867,40.190297,30.773936,31.227158,2.236068,28.358597,27.1783,18.741665
2,25.202529,19.0665,0.0,61.166996,25.646104,43.718347,30.720212,214.018793,25.403794,14.978238,...,41.6585,7.00135,197.947393,44.43584,25.425,43.362374,17.044982,40.59974,31.242902,35.244771
3,64.345448,47.853527,61.166996,0.0,54.927429,30.26037,30.568611,246.766469,75.056678,76.007933,...,62.967016,55.678452,231.698601,28.756564,78.586546,17.911997,49.182924,65.667724,33.834007,53.084932
4,49.010898,35.126948,25.646104,54.927429,0.0,27.597147,31.322556,239.646829,50.989726,36.141779,...,62.803614,27.743513,223.575702,28.893814,50.904753,39.000912,33.865949,62.689892,21.55348,53.851671
5,58.44607,40.137389,43.718347,30.26037,27.597147,0.0,23.307724,250.451592,65.493595,57.982429,...,65.596362,41.12177,234.738164,1.802776,67.334112,20.085967,40.224495,66.965962,13.238202,54.945063
6,37.181239,18.96734,30.720212,30.568611,31.322556,23.307724,0.0,227.677952,46.290895,45.509867,...,42.369743,25.164459,212.060887,22.989128,49.215802,12.920819,19.76765,43.894077,12.786712,31.689273
7,192.661083,210.511306,214.018793,246.766469,239.646829,250.451592,227.677952,0.0,188.676472,206.219209,...,185.723269,212.849712,16.297412,250.368229,188.981832,236.718812,210.670382,183.870715,237.647323,196.425457
8,12.274275,27.483433,25.403794,75.056678,50.989726,65.493595,46.290895,188.676472,0.0,21.784455,...,28.04125,24.850218,172.614515,65.845736,5.22606,58.700842,26.539388,25.141696,52.27525,28.854821
9,28.575624,31.318368,14.978238,76.007933,36.141779,57.982429,45.509867,206.219209,21.784455,0.0,...,46.597683,20.605447,190.017592,58.806565,18.877077,58.242573,29.391158,44.46972,45.921159,43.270681


### Explanation
Euclidean Distance Calculation: This metric is a straightforward calculation of the straight-line distance between two points in multi-dimensional space. It's suitable for interval and ratio-scaled data.

Handling Different Data Types: While Euclidean Distance works best with interval and ratio-scaled data, it may not be meaningful for nominal or ordinal data without preprocessing or encoding.

### Observation and Analysis
The resulting matrices will represent the pairwise Euclidean distances between data points. A smaller value indicates that the data points are closer to each other, while a larger value indicates they are further apart.

Euclidean Distance is sensitive to the scale of the data, so if attributes have different units or scales, standardization or normalization is often necessary before applying this metric.