# Manhattan Distance, 
Also known as the L1 norm or taxicab distance, this metric measures the distance between two points by summing the absolute differences of their corresponding coordinates. It’s useful for measuring distance in grid-like paths (e.g., city blocks).

### Importing required Libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from scipy.spatial.distance import cityblock

### Load Datasets

In [5]:
# Load datasets
adult_df = pd.read_csv("../adult/adult_trim.data", header=None) # No header
titanic_df = pd.read_csv('../titanic/titanic_trim.csv') # Has header

# Rename columns for clarity
adult_df.columns = ["age", "workclass", "fnlwgt", "education", "education_num", 
                    "marital_status", "occupation", "relationship", "race", "sex", 
                    "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"]
adult_df.dropna(inplace=True)

In [6]:
adult_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,29,Local-gov,115585,Some-college,10,Never-married,Handlers-cleaners,Not-in-family,White,Male,0,0,50,United-States,<=50K
96,48,Self-emp-not-inc,191277,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
97,37,Private,202683,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,>50K
98,48,Private,171095,Assoc-acdm,12,Divorced,Exec-managerial,Unmarried,White,Female,0,0,40,England,<=50K


In [7]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
150,151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.5250,,S
151,152,1,1,"Pears, Mrs. Thomas (Edith Wearne)",female,22.0,1,0,113776,66.6000,C2,S
152,153,0,3,"Meo, Mr. Alfonzo",male,55.5,0,0,A.5. 11206,8.0500,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S


### Select relevant columns from Adult dataset (mix of nominal and ratio-scaled)

In [8]:
adult_df = adult_df[["age", "workclass", "education", "education_num", "sex"]]

adult_df

Unnamed: 0,age,workclass,education,education_num,sex
0,39,State-gov,Bachelors,13,Male
1,50,Self-emp-not-inc,Bachelors,13,Male
2,38,Private,HS-grad,9,Male
3,53,Private,11th,7,Male
4,28,Private,Bachelors,13,Female
...,...,...,...,...,...
95,29,Local-gov,Some-college,10,Male
96,48,Self-emp-not-inc,Doctorate,16,Male
97,37,Private,Some-college,10,Male
98,48,Private,Assoc-acdm,12,Female


### Encode nominal attributes as integers for processing

In [9]:
label_encoders = {}
for column in adult_df.columns:
    if adult_df[column].dtype == object:
        le = LabelEncoder()
        adult_df[column] = le.fit_transform(adult_df[column])
        label_encoders[column] = le

adult_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])


Unnamed: 0,age,workclass,education,education_num,sex
0,39,6,7,13,1
1,50,5,7,13,1
2,38,3,9,9,1
3,53,3,1,7,1
4,28,3,7,13,0
...,...,...,...,...,...
95,29,2,12,10,1
96,48,5,8,16,1
97,37,3,12,10,1
98,48,3,5,12,0


### Clean and preprocess Titanic dataset

In [10]:
titanic_df.dropna(inplace=True)
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


### Select relevant columns from Titanic dataset (mix of nominal and ratio-scaled)

In [11]:
titanic_df = titanic_df[["Age", "Sex", "Pclass", "Fare", "Embarked"]]
titanic_df

Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,female,1,71.2833,C
3,35.0,female,1,53.1,S
6,54.0,male,1,51.8625,S
10,4.0,female,3,16.7,S
11,58.0,female,1,26.55,S
21,34.0,male,2,13.0,S
23,28.0,male,1,35.5,S
27,19.0,male,1,263.0,S
52,49.0,female,1,76.7292,C
54,65.0,male,1,61.9792,C


### Encode Nominal as Integers for processing

In [12]:
label_encoders_titanic = {}
for column in titanic_df.columns:
    if titanic_df[column].dtype == object:
        le = LabelEncoder()
        titanic_df[column] = le.fit_transform(titanic_df[column])
        label_encoders[column] = le

titanic_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])


Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,0,1,71.2833,0
3,35.0,0,1,53.1,1
6,54.0,1,1,51.8625,1
10,4.0,0,3,16.7,1
11,58.0,0,1,26.55,1
21,34.0,1,2,13.0,1
23,28.0,1,1,35.5,1
27,19.0,1,1,263.0,1
52,49.0,0,1,76.7292,0
54,65.0,1,1,61.9792,0


### Combine the datasets into a list for further processing

In [13]:

# Combine the datasets into a list for further processing
datasets = {
    "Adult Dataset": adult_df,
    "Titanic Dataset": titanic_df
}

### Compute Manhattan Distance

In [14]:
def manhattan_distance(a, b):
    """Calculate the Manhattan Distance between two vectors."""
    try:
        return cityblock(a, b)
    except Exception as e:
        return np.nan

# Function to create the Manhattan Distance matrix
def calculate_manhattan_matrix(dataset):
    n = len(dataset)
    manhattan_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            manhattan_matrix[i, j] = manhattan_distance(dataset.iloc[i].values, dataset.iloc[j].values)
    
    return pd.DataFrame(manhattan_matrix)

### Calculate Manhattan Distance

#### For Adult Dataset

In [15]:
manhattan_matrix_adult = calculate_manhattan_matrix(adult_df)
manhattan_matrix_adult

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.0,12.0,10.0,29.0,15.0,10.0,25.0,20.0,16.0,6.0,...,24.0,14.0,23.0,19.0,9.0,22.0,14.0,13.0,16.0,18.0
1,12.0,0.0,20.0,17.0,25.0,20.0,15.0,8.0,26.0,10.0,...,12.0,24.0,33.0,29.0,19.0,32.0,6.0,23.0,8.0,28.0
2,10.0,20.0,0.0,25.0,17.0,8.0,21.0,16.0,14.0,10.0,...,24.0,6.0,15.0,9.0,11.0,14.0,20.0,5.0,18.0,8.0
3,29.0,17.0,25.0,0.0,38.0,33.0,10.0,13.0,39.0,23.0,...,13.0,31.0,40.0,34.0,32.0,39.0,23.0,30.0,15.0,33.0
4,15.0,25.0,17.0,38.0,0.0,13.0,32.0,33.0,7.0,15.0,...,33.0,17.0,8.0,8.0,8.0,11.0,27.0,18.0,23.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,22.0,32.0,14.0,39.0,11.0,16.0,35.0,30.0,10.0,22.0,...,36.0,10.0,3.0,7.0,13.0,0.0,32.0,9.0,30.0,8.0
96,14.0,6.0,20.0,23.0,27.0,18.0,19.0,12.0,24.0,12.0,...,18.0,24.0,33.0,29.0,21.0,32.0,0.0,23.0,10.0,28.0
97,13.0,23.0,5.0,30.0,18.0,7.0,26.0,21.0,13.0,13.0,...,27.0,1.0,10.0,12.0,12.0,9.0,23.0,0.0,21.0,11.0
98,16.0,8.0,18.0,15.0,23.0,18.0,9.0,14.0,24.0,10.0,...,12.0,20.0,29.0,25.0,19.0,30.0,10.0,21.0,0.0,26.0


#### For Titanic Dataset

In [16]:
manhattan_matrix_titanic = calculate_manhattan_matrix(titanic_df)
manhattan_matrix_titanic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,0.0,22.1833,37.4208,91.5833,65.7333,65.2833,47.7833,212.7167,16.4459,37.3041,...,25.0042,30.2833,191.2375,65.7833,24.0042,65.0,21.1833,22.9167,49.7833,21.6833
1,22.1833,0.0,21.2375,69.4,49.55,43.1,25.6,226.9,38.6292,40.8792,...,39.1875,14.1,207.4208,43.6,44.1875,42.8167,3.0,39.1,30.6,26.5
2,37.4208,21.2375,0.0,88.1625,30.3125,59.8625,42.3625,246.1375,31.8667,22.1167,...,58.425,7.1375,226.6583,62.3625,25.425,61.5792,18.2375,58.3375,44.3625,47.7375
3,91.5833,69.4,88.1625,0.0,65.85,35.7,45.8,264.3,108.0292,110.2792,...,80.5875,81.3,254.8208,33.2,113.5875,26.5833,72.4,86.5,43.8,69.9
4,65.7333,49.55,30.3125,65.85,0.0,39.55,39.95,276.45,60.1792,44.4292,...,88.7375,37.45,256.9708,40.05,55.7375,39.2667,48.55,88.65,24.05,76.05
5,65.2833,43.1,59.8625,35.7,39.55,0.0,29.5,266.0,81.7292,81.9792,...,78.2875,53.0,246.5208,2.5,85.2875,30.2833,44.1,78.2,15.5,67.6
6,47.7833,25.6,42.3625,45.8,39.95,29.5,0.0,236.5,64.2292,64.4792,...,48.7875,35.5,217.0208,29.0,67.7875,19.2167,26.6,48.7,19.0,38.1
7,212.7167,226.9,246.1375,264.3,276.45,266.0,236.5,0.0,218.2708,248.0208,...,187.7125,239.0,21.4792,265.5,220.7125,237.7167,227.9,189.8,255.5,200.4
8,16.4459,38.6292,31.8667,108.0292,60.1792,81.7292,64.2292,218.2708,0.0,31.75,...,30.5583,28.7292,196.7916,82.2292,7.5583,81.4459,37.6292,28.4708,66.2292,38.1292
9,37.3041,40.8792,22.1167,110.2792,44.4292,81.9792,64.4792,248.0208,31.75,0.0,...,60.3083,28.9792,226.5416,84.4792,27.3083,83.6959,37.8792,58.2208,66.4792,49.6208


### Explanation
Manhattan Distance Calculation: This metric is the sum of the absolute differences between corresponding coordinates of two points. It’s especially useful in grid-like environments where movement is restricted to horizontal and vertical paths.

Handling Different Data Types: Like Euclidean Distance, Manhattan Distance is more meaningful for interval and ratio-scaled data, but it can also be applied to ordinal data with caution. It’s less sensitive to outliers compared to Euclidean Distance.

### Observation and Analysis
The resulting matrices will represent the pairwise Manhattan distances between data points. A smaller value indicates that the data points are closer to each other in terms of their grid-like path, while a larger value indicates they are further apart.

Manhattan Distance is particularly useful in scenarios where the difference in individual dimensions is more meaningful than their squared differences (as in Euclidean Distance).