<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Multi-Dimensional Scaling**


Estimated time needed: **45** minutes


## Use cases of Multi-Dimensional Scaling

*   Recognizing families of parts in order to design cellular manufacturing systems.
*   Creating groups of products when designing assembly areas.
*   Market research, multi-dimensional scaling is often used to plot data such as the perception of products in an easy to interpret, visual way.

For instance, suppose a realtor has many listings to sell. Each listing has several attributes such as number of bedrooms, number of bathrooms, square feet, etc. You as a Data Scientist, are hired by the realtor to find out the similarities and dissimilarities of the listings, so that the brokers can use this information when providing recommendations for the buyers.

However, since the number of attributes each listing has is bigger than what could be visualized, to have a clearer sense of how different each listing is you would need to reproduce the listing data on a lower dimension.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/images/realtor.jpeg" width=60%>


Multi-Dimensional Scaling (MDS) is a family of algorithms, one version of which is Principal Component Analysis (PCA). Like PCA, MDS can be used for dimensionality reduction; MDS can also be used to map complex differences into visual space. Additional articles on MDS:   <a href="https://link.springer.com/chapter/10.1007/978-3-642-82580-4_139?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01"> \[1]</a>, <a href="https://www.djsresearch.co.uk/glossary/item/Multi-Dimensional-Scaling?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01"> \[2]</a>

There are several different categories of Multidimensional scaling (MDS). In this lab, we will review  Metric MDS as well as Non-Metric MDS scaling using **scikit-learn** library. For more information on MDS, please see <a href="https://arxiv.org/pdf/2009.08136.pdf?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01"> \[3]</a>.


Steps in MDS analysis:

*   Pre-process and generate training and testing datasets
*   Train and fine-tune logistic regression models
*   Interpret trained logistic regression models
*   Evaluate trained logistic regression models


## **Table of Contents**

<ol>
    <li><a href="https://#Objectives">Objectives</a></li>
    <li><a href="https://#Datasets">Datasets</a></li>
    <li>
        <a href="https://#Setup">Setup</a>
        <ol>
            <li><a href="https://#Installing-Required-Libraries">Installing Required Libraries</a></li>
            <li><a href="https://#Importing-Required-Libraries">Importing Required Libraries</a></li>
            <li><a href="https://#Defining-Helper-Functions">Defining-Helper-Functions</a></li>
        </ol>
    </li>
    <li>
        <a href="https://#Metric-MDS">Metric MDS</a>
        <ol>
            <li><a href="https://#From-Relative-Location-to-Absolute-Location">From Relative Location to Absolute Location</a></li>
            <li><a href="https://#Example-1">Example 1</a></li>
        </ol>
    </li>
    <li><a href="https://#Non-Metric-MDS">Non-Metric MDS</a></li>
    <li>
        <a href="https://#Dimensionality-reduction-with-MDS">Dimensionality reduction with MDS</a>
        <ol>
            <li><a href="https://#Exercise-1">Exercise 1 </a></li>
        </ol>    
    </li>   
    <li><a href="https://#T-Distributed-Stochastic-Neighbor-Embedding-(optional)">T-Distributed Stochastic Neighbor Embedding (optional)</a></li>
</ol>


## Objectives


After completing this lab you will be able to:


*   **Understand** different types of Multi-Dimensional Scaling
*   **Understand** concepts of Metric MDS and Non-Metric MDS, including: embedding space, minimization and Stress
*   **Apply**  Metric-MDS and Non-Metric MDS
*   **Apply** different distance metrics to  Metric MDS and Non-Metric MDS
*   **Apply**  MDS to dimensionality reduction


## Datasets

Datasets for this lab are gathered from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01) under the MIT License.


## Setup


### Installing Required Libraries

The following required modules are pre-installed in the Skills Network Labs environment. However if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda) you will need to install these libraries by removing the `#` sign before `!mamba` in the code cell below.


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
from scipy.spatial.distance import euclidean, cityblock, cosine
from sklearn.metrics import pairwise
from sklearn.preprocessing import  MinMaxScaler
from matplotlib import offsetbox
from sklearn.manifold import MDS

In [7]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

### Defining Helper Functions


This function plots out labeled scatter plots of latitude and longitude data


In [8]:
def plot_points(df, color="red", title=""):
    X = df['lon']
    Y = df['lat']
    annotations = df.index
    plt.figure(figsize=(8, 6))
    plt.scatter(X, Y, s=100, color=color)
    plt.title(title)
    plt.xlabel("lat")
    plt.ylabel("log")
    
    for i, label in enumerate(annotations):
        plt.annotate(label, (X[i], Y[i]))
    plt.axis('equal')
    plt.show()

This function plots out labelled scatter plots of digits dataset in two dimensions  after a Dimensionality reduction:


In [9]:
def plot_embedding(X, title, ax):
    X = MinMaxScaler().fit_transform(X)
    for digit in digits.target_names:
        ax.scatter(
            *X[y == digit].T,
            marker=f"${digit}$",
            s=60,
            color=plt.cm.Dark2(digit),
            alpha=0.425,
            zorder=2,
        )
    shown_images = np.array([[1.0, 1.0]])  # just something big
    for i in range(X.shape[0]):
        # plot every digit on the embedding
        # show an annotation box for a group of digits
        dist = np.sum((X[i] - shown_images) ** 2, 1)
        if np.min(dist) < 4e-3:
            # don't show points that are too close
            continue
        shown_images = np.concatenate([shown_images, [X[i]]], axis=0)
        imagebox = offsetbox.AnnotationBbox(
            offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]
        )
        imagebox.set(zorder=1)
        ax.add_artist(imagebox)

    ax.set_title(title)
    ax.axis("off")

## Metric MDS


Metric MDS  represents points in an embedding space $\boldsymbol{Z}$  by preserving the distances $d\_{i,j}$ distance between $i-th $ and $j-th $ objects. Each distance is given by:


\begin{pmatrix}
d\_{1,1} & d\_{1,2} & \cdots & d\_{1,N} \\\\\\\\
d\_{2,1} & d\_{2,2} & \cdots & d\_{2,N} \\\\\\\\
\vdots & \vdots & & \vdots \\\\\\\\
d\_{N,1} & d\_{N,2} & \cdots & d\_{N,N}
\end{pmatrix}.


For the distance $d\_{i,j}$ between objects $x_i$ and $x_j$,  we find corresponding points $z_i$ and $z_j$ that minimize the cost function called “**Stress**”, which is a residual sum of squares:


$\text{Stress}*D(z\_1,z\_2,...,z_N)=\Biggl(\sum*{i\ne j=1,...,N}\bigl(d\_{ij}-|z_i-z_j|\bigr)^2\Biggr)^{1/2}$


The goal is to find the embeddings   $z_i$ ,$z_j$ whose euclidean distance is most similar to the original $d\_{i,j}$. We will experiment with several different distance metrics for $d\_{i,j}$, and for the embedding space we can use any distance metric $d(z_i,z_j)$.  We will focus on the euclidean distance $d(z_i,z_j)= |z_i-z_j|$. For more check out <a href="https://arxiv.org/pdf/2009.08136.pdf?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01"> \[1]</a>.


### From Relative Location to Absolute  Location


To better understand how MDS works, we would like to find the  position of several cities, given the length of the Relative latitude and longitude. Latitude is an angle that specifies the north–south position of a point on the Earth, which ranges from 0° at the Equator to 90° (North or South) at the poles. Lines of constant latitude, or parallels, run east–west as circles parallel to the equator. Longitude(λ) specifies the east–west position of a point on the Earth's surface, usually denoted by the Greek letter lambda (λ). Meridians (lines running from pole to pole) connect points with the same longitude as show here.


<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/images/lat-log.png" width="300" alt="loglat"  />
</center>


Latitude and longitude are calculated by examining celestial bodies' angles, sometimes in combination with time. We can calculate the difference between the angle relatively easily. This is shown in the figure below, where the boat and the city have different angles with the sun:


<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/images/Screen_Shot_2022-03-01_at_6.26.10_PM.png"  width="600" />
</center>


In [10]:
distance = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/distance.csv').set_index('name')
distance.head(8)


Unnamed: 0_level_0,Buenos Aires,Paris,Melbourne,St Petersbourg,Abidjan,Montreal,Nairobi,Salvador
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Buenos Aires,0.0,83.0,4.0,93.93,39.33,79.52,32.71,21.03
Paris,83.0,0.0,87.0,10.93,43.67,3.48,50.29,61.97
Melbourne,4.0,87.0,0.0,97.93,43.33,83.52,36.71,25.03
St Petersbourg,93.93,10.93,97.93,0.0,54.6,14.41,61.22,72.9
Abidjan,39.33,43.67,43.33,54.6,0.0,40.19,6.62,18.3
Montreal,79.52,3.48,83.52,14.41,40.19,0.0,46.81,58.49
Nairobi,32.71,50.29,36.71,61.22,6.62,46.81,0.0,11.68
Salvador,21.03,61.97,25.03,72.9,18.3,58.49,11.68,0.0


For Multidimensional Scaling in `sklearn`, we import the `MDS` constructor from the `manifold` module:


We create an MDS object `embedding` with the following parameters:


`n_components`: Number of dimensions in which to immerse the dissimilarities, default=2

`precomputed`: Pre-computed dissimilarities are passed directly to **fit** and **fit_transform**

`max_iter` : Maximum number of iterations of the <a href='https://scikit-learn.org/stable/modules/generated/sklearn.manifold.smacof.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01'> SMACOF</a> algorithm for a single run, default = 300

`eps`: Relative tolerance with respect to stress at which to declare convergence, default=1e-3
