# Embedding Analysis

Return to the [index](https://github.com/Nkluge-correa/worldwide_AI-ethics).

Use this notebook to create the 3D scatter plot of the principles in the WAIE dataset.

> Note: The embedding vectors were generated via OpenAI's [`text-embedding-ada-002`](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) API.

In [6]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd
import numpy as np
import textwrap

df = pd.read_parquet('data/embeddings_dataset.parquet')

# Number of components to keep during PCA
pca_n_components = 500

# Perplexity and number of iterations for t-SNE
perplexity = 50
number_of_iterations = 1_000

# Learning rate for t-SNE
learning_rate=10

# Color to use for the plot
color = 'principles' # 'principles', 'document_regulation', 'institution_type', 'world_region', 'country'

# Transform the values of our `df.embeddings` column into a 2D numpy array where each row corresponds to an embedding vector.
X = np.squeeze(np.transpose(np.dstack(df.embeddings.values)))

# Perform PCA on the embeddings
pca = PCA(n_components=pca_n_components)
pca_result = pca.fit_transform(X)

# Perform t-SNE on the embeddings
tsne = TSNE(n_components=3, 
    verbose=1, 
    perplexity=perplexity, 
    n_iter=number_of_iterations, 
    learning_rate=learning_rate
)
tsne_results = tsne.fit_transform(pca_result)

# Create a new dataframe with the t-SNE results
tsne_df = pd.DataFrame(tsne_results, columns=['tsne_1', 'tsne_2', 'tsne_3'] )
tsne_df = pd.concat([tsne_df, df], axis=1)
tsne_df.columns= ['tsne_1', 'tsne_2', 'tsne_3'] + df.columns.tolist()

# Plot the t-SNE results as a 3D scatter plot
fig = px.scatter_3d(
    tsne_df, x='tsne_1', y='tsne_2', z='tsne_3', color=color,
    labels={color: "<b>" + color.replace('_', ' ').title() + "</b>"},
    hover_data={'tsne_1': False, 
                'tsne_2':False,
                'tsne_3':False,
                color:False,
                '<b>Document ID </b>': [" <i>" + textwrap.fill(x, width=80).replace('\n', '<br>') \
                    + "</i>" for x in tsne_df['documents_ids']],
                '<b>Year of Publication </b>': " " + tsne_df['year_of_publication'],
                '<b>Reguation Type </b>': " " + tsne_df['document_regulation'],
                '<b>Institution Type </b>': " " + tsne_df['institution_type'],
                '<b>World Region </b>': " " + tsne_df['world_region'],
                '<b>Country </b>': " " + tsne_df['country'],
                '<b>Principle </b>': " " + tsne_df['principles'],
                '<b>Description </b>': [" <i>" + textwrap.fill(x, width=80).replace('\n', '<br>') \
                    + "</i>" for x in tsne_df['text']] , 
            }
)

fig.update_layout(template='ggplot2',
                  title=f'<b>t-SNE with {pca_n_components if X.shape[0] > pca_n_components else int(X.shape[0]/2)} \
components ranked by PCA<br>Total Explained Variance: {pca.explained_variance_ratio_.sum() * 100:.2f}%</b>',
                  title_x=0.5,
                  scene=dict(
                    xaxis=dict(showticklabels=False, visible=False),
                    yaxis=dict(showticklabels=False, visible=False),
                    zaxis=dict(showticklabels=False, visible=False),
                  ))
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)

fig.add_annotation(dict(font=dict(color='black',size=12),
                                        x=0.05,
                                        y=-0.1,
                                        showarrow=False,
                                        text="<b><i>Word embeddings were attained<br>via the OpenAI API using<br>text-embedding-ada-002.</b></i>",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper",
                                        bordercolor='black',
                                        borderwidth=1,
                                        bgcolor="white"))

fig.show()
#fig.write_html("data/tsne.html")

[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 1456 samples in 0.000s...
[t-SNE] Computed neighbors for 1456 samples in 0.116s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1456
[t-SNE] Computed conditional probabilities for sample 1456 / 1456
[t-SNE] Mean sigma: 0.153985
[t-SNE] KL divergence after 50 iterations with early exaggeration: 66.817604
[t-SNE] KL divergence after 950 iterations: 1.328897


Bellow, we create a distance matrix, using the results of the t-SNE analysis as our distance metric. Here, every principle is compared to every other principle (distance of the centroid of a cluster to another cluster) and itself (average distance from all points belonging to the same class to its own centroid).

In [12]:
from sklearn.metrics.pairwise import euclidean_distances

# Create a pandas dataframe to store the distance matrix
distance_matrix = pd.DataFrame(index=tsne_df.principles.unique(), columns=tsne_df.principles.unique())

# Loop through each principle and calculate the average distance from the cluster centroid
for principle in tsne_df.principles.unique():
    temp_df = tsne_df[tsne_df.principles == principle]

    avg_distance = euclidean_distances(temp_df[['tsne_1', 'tsne_2', 'tsne_3']],
                                      temp_df[['tsne_1', 'tsne_2', 'tsne_3']]).mean()

    distance_matrix.loc[principle, principle] = avg_distance

    principles = tsne_df.principles.unique().tolist()
    principles.remove(principle)

    # Now, calculate the average distance from the cluster centroid for each other principle
    for p in principles:
        temp_df_2 = tsne_df[tsne_df.principles == p]

        distance = euclidean_distances(temp_df[['tsne_1', 'tsne_2', 'tsne_3']],
                                      temp_df_2[['tsne_1', 'tsne_2', 'tsne_3']]).mean()

        distance_matrix.loc[principle, p] = distance
        distance_matrix.loc[p, principle] = distance


# Plot the distance matrix as a heatmap
fig = px.imshow(distance_matrix.values,
                x=distance_matrix.columns,
                y=distance_matrix.columns,
                text_auto=True,
                color_continuous_scale='viridis',
                labels=dict(x="Principle", 
                            y="Principle", 
                            color="Distance"))

fig.update_xaxes(side='bottom', tickangle=-45, tickfont=dict(size=10))
fig.update_layout(template='ggplot2',
                  coloraxis_showscale=False)

fig.add_annotation(dict(font=dict(color='black',size=12),
                                        x=0.75,
                                        y=0.9,
                                        showarrow=False,
                                        text="<b><i>This distance matrix was<br>calculated using the Euclidian distance.<br>The matrix represents the mean<br>distance from the cluster<br>centroid for every embedding group.</b></i>",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper",
                                        bordercolor='black',
                                        borderwidth=1,
                                        bgcolor="white"))

fig.add_annotation(dict(font=dict(color='black',size=12),
                                        x=0.05,
                                        y=-0.1,
                                        showarrow=False,
                                        text="<b><i>Word embeddings were attained<br>via the OpenAI API using<br>text-embedding-ada-002.</b></i>",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper",
                                        bordercolor='black',
                                        borderwidth=1,
                                        bgcolor="white"))


fig.show()
#fig.write_html("data/distance-matrix.html")

All of these plots are presented in [WAIE website](https://nkluge-correa.github.io/worldwide_AI-ethics/).

---

Return to the [index](https://github.com/Nkluge-correa/worldwide_AI-ethics).