<a href="https://colab.research.google.com/github/HanSong19/PALS0039-Introduction-to-Deep-Learning-for-Speech-and-Language-Processing-/blob/main/PALS0039_Ex_2_3_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.3 Classification task

In this exercise we train a model to classify vowels from their [formant frequencies](https://en.wikipedia.org/wiki/Formant).

The following code reads in and summarises a data set of vowel formant frequencies.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/exercise_02/vowels.csv")

print(df)
print("----------------------------------------------------")
print(df.describe())

(a) Create three boxplots that show the differences in F1, F2, and HEIGHT between male and female samples.

Hint: You could use [`plt.subplots`](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html) to position all the plots in a row or column. You could use the [`boxplot` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html) of `DataFrame` to create each plot.

In [None]:
#(a)
f1_male = df.query('SEX =="male"')["F1"]                  # function .query is a data base acccess and it gives access to the subset of the data
f1_female = df.query('SEX =="female"')["F1"]
f1 = pd.concat([f1_male, f1_female], axis=1)
f1.columns = ["male", "female"]

f2_male = df.query('SEX =="male"')["F2"]
f2_female = df.query('SEX =="female"')["F2"]
f2 = pd.concat([f2_male, f2_female], axis=1)
f2.columns = ["male", "female"]

height_male = df.query('SEX =="male"')["HEIGHT"]
height_female = df.query('SEX =="female"')["HEIGHT"]
height = pd.concat([height_male, height_female], axis=1)
height.columns = ["male", "female"]

fig, axs = plt.subplots(1, 3, figsize=(16, 5))
axs[0].set_ylabel("F1 (Hz)")
f1.boxplot(ax=axs[0])
axs[1].set_ylabel("F2 (Hz)")
f2.boxplot(ax=axs[1])
axs[2].set_ylabel("HEIGHT (cm)")
height.boxplot(ax=axs[2])
plt.show()

---
(b) This code plots an F1-F2 scatter plot in which different vowels are displayed in different colours. Run the code and then add comments to the code to describe what is happening in each step.


In [None]:
# convert to "category" type
df["VOWEL"]=df.VOWEL.astype("category")

# encode each category with a unique number
df["VOWELIDX"]=df.VOWEL.cat.codes

# plot formants in dataframe data
# data frame must have a column "VOWELIDX" to distinguish vowel categories
def plot_formants(data, f1="F1", f2="F2", axis_ranges=[3000,500,1100,100]):
  plt.figure(figsize=(10,10))
  plt.scatter(data[f2], data[f1], c=data.VOWELIDX, cmap="tab10")
  if axis_ranges: plt.axis(axis_ranges)
  plt.xlabel("F2")
  plt.ylabel("F1")
  plt.grid()

plot_formants(df)
plt.show()

(c) Interpreting the above scatter plot: Will perfect classification be possible using these measurements (F1 and F2 in Hz)? Why?

In [None]:
#(c)
# No, the scatter plot indicates a large degree of overlap between different vowels, the vowels are not cleanly separable in F1-F2 space.

The code below randomly selects a small held-out test set (which is plotted). The remaining samples are defined as the training set.

(d) Is the test set fully representative of the task? Why?

In [None]:
test_set = df.sample(frac=0.05, random_state=0)                                 #frac=5% of the entire data
print(test_set.describe())

train_set = df.drop(test_set.index)
print(train_set.describe())

plot_formants(test_set)
plt.show()

#(d)
# - Not really, we did not ensure that all vowel classes are represented in the test set. 5% of 484 is 24 but we have only 10 classes.
# 10 classes might be represented in 24 dataset but it might not either.
# random splits can work in large data sets but be careful when data points are correlated!

(e) Use `sklearn` to train a [Nearest Neighbour Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) on the training set. The inputs should be `F1` and `F2` and the output should be the `VOWEL`. Configure the classifier to use the 3 nearest neighbours.

Hint: You need to define the classifier then call the [`fit` method](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
#Nearest Neighbour Classifier: look into what is the closest value. maching learning method
#(e)
clf = KNeighborsClassifier(n_neighbors=3)                                       #look at 3 neareest neighbours around me.
clf.fit(train_set[["F1", "F2"]], train_set["VOWEL"])                            # only do it with the training set! not the test set

(f) Determine the classification accuracy of your classifier on the train and test sets.

Hint: You can use the [`score` method](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score) of the classifier.

In [None]:
#(f)
print("TRAIN SET ACCURACY:", clf.score(train_set[["F1", "F2"]], train_set["VOWEL"]), sep="\t")
print("TEST SET ACCURACY:", clf.score(test_set[["F1", "F2"]], test_set["VOWEL"]), sep="\t")
# it is better for the training set but the test set is only 54. It could be over-fitting or it could be because the sample was too little
# and all catetories might not be all represented in the training data

(g) The following code normalises the data using the [z-score](https://en.wikipedia.org/wiki/Standard_score) for each speaker individually.

Add comments to each code block to explain what is happening.

In [None]:
#normalize to gausian distribution by setting the mean =0 and deviation=1
# for each speaker, calculate mean and standard deviation across all samples of that speaker
means = df.groupby(['SPEAKER']).agg("mean")                                     #I group it by speakers and then for each speaker I take the mean
stds = df.groupby(['SPEAKER']).agg("std")

# convert to numpy arrays (ndarray)                                             # numpy data type can be used in other areas such as tensle flow so it is good to change to numpy from Pandas
F1mean = means.F1[df.SPEAKER].to_numpy()                                        
F1std = stds.F1[df.SPEAKER].to_numpy()
F2mean = means.F2[df.SPEAKER].to_numpy()
F2std = stds.F2[df.SPEAKER].to_numpy()

# normalise F1 and F2 to have zero mean and unit variance
df["F1norm"] = (df.F1 - F1mean) / F1std
df["F2norm"] = (df.F2 - F2mean) / F2std

# some general statistics on each column of df
print(df.describe())

# use plot_formants on normalised axes for f1 and f2
plot_formants(df, f1="F1norm", f2="F2norm", axis_ranges=None)
plt.show()

# statistics about test and training set
test_set = df.sample(frac=0.05, random_state=0)
print(test_set.describe())
train_set = df.drop(test_set.index)
print(train_set.describe())

In [None]:
# we normalize the test set so I am not supposed to normalize the test data.


(h) Train and evaluate a new classifier, as before, using the normalised formant data. What was the effect on the classification accuracy? Why?

In [None]:
#(h)
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train_set[["F1norm", "F2norm"]], train_set["VOWEL"])

print("TRAIN SET ACCURACY:", clf.score(train_set[["F1norm", "F2norm"]], train_set["VOWEL"]), sep="\t")
print("TEST SET ACCURACY:", clf.score(test_set[["F1norm", "F2norm"]], test_set["VOWEL"]), sep="\t")

# - The accuracy improved.
# - Each speaker has slightly different F1 and F2 ranges (depending on their vocal tract) -- after normalisation the data is more separable.

(i) In this exercise we calculated the statistics for normalisation on all the data (train and test sets combined), is this problematic? What would the consequence be when calculating the generalisation error? When deploying this system, how would we perform this normalisation for a new (unseen) speaker?

In [None]:
#(i)
# - The generalisation error could be underestimated because the statistics estimates contained the test samples!
# - For an unseen speaker we would need to collect enough data to estimate their statistics first.

(j) In this exercise we did not make use of a validation set, was it necessary? Why?

In [None]:
#(f)
# - The use of a validation set would have been necessary if we wanted to find better hyperparameters for the classifier
#   (e.g. it is possible that using a larger number of neighbours could result in better generalisation error)