# Visualise the sounds of letters spoken by 30 speakers

The isolet1 dataset contains summarised audio data, collected 
from 30 speakers, each of whom speaks each letter of the alphabet twice.
The data summarises the raw audio waveforms as 617 real-valued
numbers.

The data also indicates what letter was spoken by each participant.

The data is intended for building classification models, but
here we are using it to become familiar with dimensionality
reduction techniques (PCA, t-SNE and UMAP).


## Preparing the data for visualisation

First import the necessary library modules

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Might need to install umap first using conda install conda-forge::umap-learn
# In Feb 2025 I discovered a version incompatibility...
# umap uses an older form of numpy constants.
# Unfortunately, since numpy 2.0, many older ways of referring to constants have been removed.
# The developers of umap have not updated their code to use the new way.
# I found that creating a python 3.11 environment in conda, and installing numpy and other packages there seemed to fix the problem.
# That is, umap in a python 3.11 environment was able to work OK.
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

import os
if not os.path.exists('res'):
    os.makedirs('res')

In [None]:
# Declare a function that can be used to apply the reducer and return the reduced dataframe

In [None]:
def applyReducer(reducer, df, colPrefix, n_components):
  nd_reduced = reducer.fit_transform(df)
  # Now derive df_reduced from the nd_reduced array, assigning suitable column names, using the default range index.
  columns = [f'{colPrefix}{i}' for i in range(0,n_components)]
  df_reduced = pd.DataFrame(data=nd_reduced, columns=columns)
  display(df_reduced.head())
  return df_reduced, reducer

In [None]:
# Declare a function that can be used to visualise the reduced dimensions of the data

In [None]:
def visualiseReducedDim(df2, grouping, reducerType):
  fig = plt.figure()
  ax = fig.add_subplot(111)
  col = df2.columns
  sns.scatterplot(
    x=col[0], y=col[1],
    hue=grouping,
    palette=sns.color_palette(palette="hls", n_colors=26),
    data=df2,
    legend="full",
    ax=ax
  )
  ax.set_title(f'{reducerType} visualisation')
  ax.set_xlabel(col[0])
  ax.set_ylabel(col[1])
  fig.savefig(f'res/{reducerType}.pdf')

## Reading the data

Read the data, check its size and view the first 10 rows

In [None]:
df = pd.read_csv('data/isolet1.csv', sep=',', header=None)
m,n = df.shape
print(f'Number of rows is {m} and number of columnes is {n}\n')
display(df.head(10))

Check that the first columns are the audio data and the last
column is the index of each letter: 1 -> 'a', 2 -> 'b', etc.

In [None]:
# Count unique values in each column using nunique()
p = df.nunique()
print(f"Number of unique values in each column:\n{p}\n")

The first row in the file contained data, not column names.
So the dataframe column names are just integers starting at 0
So we are going to generate new field names that start with 'f'
and are followed by the index, left padded with zeros

In [None]:
# Derive the column names, so that all but the last column has
# a generated name and the last column is named 'letter'
# See https://stackoverflow.com/a/339013/1988855
colNames = ["f"+f'{i:03}' for i in range(1,n)]
colNames.append('letter')
#print(colNames)
df.columns = colNames
print(df['letter'].unique())

For the last column, map the letterIds (1..26) to lower case letters
('a'..'z') and check that this mapping worked OK.

In [None]:
# Note that we use mapping from a dictionary to do this efficiently.
numLetters = 26
letterIds = list(range(1,numLetters+1))
# See https://www.geeksforgeeks.org/alphabet-range-in-python/
lc_letters = [chr(i) for i in range(ord('a'), ord('z') + 1)]
# See https://stackoverflow.com/a/209854/1988855
toLcLetters = dict(zip(letterIds, lc_letters))
# See https://stackoverflow.com/a/41678874/1988855
df['letter'] = df['letter'].map(toLcLetters)
display(df.head(10))

## Reducing the dimensions before starting the visualisations

We could use k-means partitional clustering on the (numeric-valued)
audio data to derive 26 clusters. Hopefully each of these 26 clusters
would be mostly made up of a single letter.
However, this notebook focuses on visualising data subsets, so we will
skip the clustering in this case and just use the assigned letters
as labels for each set, and we will see how the subsets appear in visualisations
obtained by
1. Linear PCA to 2 dimensions
2. t-SNE to 2 dimensions
3. UMAP to 2 dimensions

To reduce the computational requirements, especially for t-SNE, we will first
apply linear PCA to reduce the dimensionality of the data from 617 columns.
We will then use the resulting reduced data as input to each of the dimensionality reduction
techniques, so they are comparable.
In this first pass of PCA, we choose to keep `keepRatio` of the variance, and to drop
the remaining components, reducing the dimensions from 617 to 32.
Since the data was already scaled, we do not need to scale it again, as would normally
be needed for applying PCA.

In [None]:
# Let the solver choose the number of components to keep, so that they capture
# at least keepRatio of the original variance
keepRatio = 0.8
pca1=PCA(n_components=keepRatio, svd_solver='full')
# See https://www.geeksforgeeks.org/select-all-columns-except-one-given-column-in-a-pandas-dataframe/
# Note that drop() does not affect the original dataframe unless we add inplace=True
pca1.fit(df.drop('letter', axis=1))
nd_PCA1 = pca1.transform(df.drop('letter', axis=1))
# Get the actual number of componwents directly from the object
n_components_PCA1 = pca1.n_components_
# Now derive df_PCA1 from the nd_PCA1 array, assigning suitable column names, using the default range index.
columns_PCA1 = [f'pc{i}' for i in range(0,n_components_PCA1)]
df_PCA1 = pd.DataFrame(data=nd_PCA1, columns=columns_PCA1)
print(f'Number of rows is {m} and number of columns after {keepRatio} of variance is kept is {n_components_PCA1}\n')
display(df_PCA1.head(10))

## Normalise the scaling of df_PCA1

Dimensionality reduction is affecting by scaling, so it is advisable to 
normalise the scaling of each column of the input dataframe df_PCA1.

In [None]:
scaler = MinMaxScaler()
nd_PCA1 = scaler.fit_transform(df_PCA1.to_numpy())
df_PCA1 = pd.DataFrame(data=nd_PCA1, columns=columns_PCA1)

## Use PCA to further reduce to n_components = 2 dimensions, yielding df_PCA2

In [None]:
n_components = 2
reducer = PCA(n_components=n_components)
colPrefix = 'pc'
df_PCA2, reducer = applyReducer(reducer, df_PCA1, colPrefix, n_components)
explainedVarianceRatio = reducer.explained_variance_ratio_
print(f'Number of rows is {m} and number of columns is {n_components} with explained variance ratio {explainedVarianceRatio}\n')

## Visualise the 2-component PCA fit: df_PCA2

In [None]:
visualiseReducedDim(df_PCA2, df['letter'], 'PCA')

## Use TSNE to further reduce df_PCA1 to n_components = 2 dimensions, yielding df_TSNE

In [None]:
perplexity=30
print(f'Use t-SNE with perplexity {perplexity} to reduce the dimensions nonlinearly to {n_components}\n')
reducer = TSNE(n_components=n_components, learning_rate='auto', init='random', random_state = 42, perplexity=perplexity, verbose=1)
colPrefix = 'tsne'
df_TSNE, reducer = applyReducer(reducer, df_PCA1, colPrefix, n_components)

## Visualise the 2-component TSNE fit: df_TSNE

In [None]:
visualiseReducedDim(df_TSNE, df['letter'], 'TSNE')

## Use UMAP to further reduce df_PCA1 to n_components = 2 dimensions, yielding df_UMAP

In [None]:
print(f'Use UMAP to reduce the dimensions nonlinearly to {n_components}\n')
reducer = umap.UMAP(n_components=n_components, random_state=42)
colPrefix = 'umap'
df_UMAP, reducer = applyReducer(reducer, df_PCA1, colPrefix, n_components)

## Visualise the 2-component UMAP fit: df_UMAP

In [None]:
visualiseReducedDim(df_UMAP, df['letter'], 'UMAP')

## Exercises

1. Investigate the choice of the TSNE hyperparameters `perplexity`, `early_exaggeration` and `learning_rate`
   and inspect the resulting plots to choose the best combination of these parameter settings.
2. Investigate the choice of the UMAP hyperparameters `n_neighbors` and `min_dist` and inspect the resulting
   the resulting plots to choose the best combination of these parameter settings. 