<a href="https://colab.research.google.com/github/HanSong19/PALS0039-Introduction-to-Deep-Learning-for-Speech-and-Language-Processing-/blob/main/PALS0039_Ex_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.3 Classification task

In this exercise we train a model to classify vowels from their [formant frequencies](https://en.wikipedia.org/wiki/Formant).

The following code reads in and summarises a data set of vowel formant frequencies.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/exercise_02/vowels.csv")

print(df)
print("----------------------------------------------------")
print(df.describe())

(a) Create three boxplots that show the differences in F1, F2, and HEIGHT between male and female samples.

Hint: You could use [`plt.subplots`](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html) to position all the plots in a row or column. You could use the [`boxplot` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html) of `DataFrame` to create each plot.

In [None]:
#(a)
sexdata= pd.DataFrame(df.groupby(['SEX'])['F1','F2','HEIGHT'].mean())
sexdata.reset_index(inplace=True)
print(sexdata)

In [None]:
#function .query allows to access df and subset of the data
f1_female = df.query("SEX == 'female'")['F1']
f1_male = df.query("SEX == 'male'")['F1']
f1=pd.concat([f1_female, f1_male], axis = 1)
f1.columns = ['female', 'male']

f2_female= df.query("SEX == 'female'")['F2']
f2_male=df.query("SEX == 'male'")['F2']
f2=pd.concat([f2_female, f2_male], axis =1)
f2.columns=['female', 'male']

height_female = df.query("SEX == 'female'")["HEIGHT"]
height_male = df.query("SEX == 'male'")["HEIGHT"]
height= pd.concat([height_female, height_male], axis=1)
height.columns =['female', 'male']
print(f1)

import seaborn as sns
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize = (15,6))

ax1.set_ylabel ("F1 (Hz)")
f1.boxplot(ax=ax1)
ax2.set_ylabel("F2 (Hz)")
f2.boxplot(ax=ax2)
ax3.set_ylabel("HEIGHT (cm)")
height.boxplot(ax=ax3)
plt.show()

In [None]:
sexdata= pd.DataFrame(df.groupby(['SEX'])['F1','F2','HEIGHT'].mean())
sexdata.reset_index(inplace=True)
print(sexdata)


---
(b) This code plots an F1-F2 scatter plot in which different vowels are displayed in different colours. Run the code and then add comments to the code to describe what is happening in each step.


In [None]:
# convert to "category" type
df["VOWEL"]=df.VOWEL.astype("category")

# encode each category with a unique number
df["VOWELIDX"]=df.VOWEL.cat.codes

# plot formants in dataframe data
# data frame must have a column "VOWELIDX" to distinguish vowel categories
def plot_formants(data, f1="F1", f2="F2", axis_ranges=[3000,500,1100,100]):
  plt.figure(figsize=(10,10))
  plt.scatter(data[f2], data[f1], c=data.VOWELIDX, cmap="tab10")
  if axis_ranges: plt.axis(axis_ranges)
  plt.xlabel("F2 [Hz]")
  plt.ylabel("F1")
  plt.grid()

plot_formants(df)
plt.show()

In [None]:
## plot with Pandas Plot

#make each vowel into a category
df['VOWEL'] = df['VOWEL'].astype('category')
#change the strings of category into integers because models cannot
#work on categorical variable with the forms of string
#df["VOWELIDX"]=df.VOWEL.cat.codes
#df

df.plot(x='F2', y='F1', c='VOWEL', kind='scatter', cmap='YlGnBu_r')

In [None]:
#plot with matplotlib
plt.scatter(x=df['F2'], y=df['F1'], c= df.VOWELIDX, cmap='tab10')
plt.show()

(c) Interpreting the above scatter plot: Will perfect classification be possible using these measurements (F1 and F2 in Hz)? Why?

In [None]:
#(c)
# No, the scatter plot indicates a large degree of overlap between different vowels, the vowels are not cleanly separable in F1-F2 space.

The code below randomly selects a small held-out test set (which is plotted). The remaining samples are defined as the training set.

(d) Is the test set fully representative of the task? Why?

In [None]:
#get 5% of the whole data as test_set
test_set = df.sample(frac=0.05, random_state=0)
print(test_set.describe())

#index shows the ordered number of each values (1번, 2번 3번 ...) and drop index together
train_set= df.drop(test_set.index)
print(train_set.describe())

test_set.plot(x='F2', y='F1', c='VOWELIDX', kind='scatter',cmap='Accent')
plt.show()



#(d)
# - Not really, we did not ensure that all vowel classes are represented in the test set. 5% of 484 is 24 but we have only 10 classes.
# random splits can work in large data sets but be careful when data points are correlated!

(e) Use `sklearn` to train a [Nearest Neighbour Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) on the training set. The inputs should be `F1` and `F2` and the output should be the `VOWEL`. Configure the classifier to use the 3 nearest neighbours.

Hint: You need to define the classifier then call the [`fit` method](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_set[['F1','F2']], train_set['VOWEL'])


(f) Determine the classification accuracy of your classifier on the train and test sets.

Hint: You can use the [`score` method](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score) of the classifier.

In [None]:
#(f)
print("test set accuracy:", neigh.score(test_set[['F1','F2']], test_set['VOWEL']), sep="\t")
print("train set accuracy:", neigh.score(train_set[['F1','F2']], train_set['VOWEL']),sep="\t")

## the accuracy of training is higher than that of test (75 vs. 54)
## this means that there might have been over-fitting in the training. or the sample was too little
## all vowels might not be well represented in the training data

(g) The following code normalises the data using the [z-score](https://en.wikipedia.org/wiki/Standard_score) for each speaker individually.

Add comments to each code block to explain what is happening.

In [None]:
# for each speaker, calculate mean and standard deviation across all samples of that speaker
mean = df.groupby(['SPEAKER']).mean() #group by speakers so that I can get eash speaker's mean in all values(F1, F2...)
std=df.groupby(['SPEAKER']).std()
print(mean)
print(std)



In [None]:
# convert to numpy arrays (ndarray) of F1 and F2 mean
# This is because Numpy can be also used in Tensle Flow and it is better to change df of pandas to numpy
# added [dr['SPEAKER']] because I need to calculate normalization later and the number of values should match in df. and F1_mean
F1_mean=mean['F1'][df['SPEAKER']].to_numpy()
F1_std=std['F1'][df['SPEAKER']].to_numpy()
F2_mean=mean['F2'][df['SPEAKER']].to_numpy()
F2_std=std['F2'][df['SPEAKER']].to_numpy()
print("F1_mean:", F1_mean)
print("F2_std:", F2_std)


# normalise F1 and F2 to have zero mean and unit variance
df["F1norm"] = (df.F1 - F1_mean) / F1_std
df["F2norm"] = (df.F2 - F2_mean) / F2_std

# some general statistics on each column of df
print(df.describe())


In [None]:
# use plot_formants on normalised axes for f1 and f2
df.plot(x="F2norm", y="F1norm", c="VOWEL", kind='scatter')
plt.show()


(h) Train and evaluate a new classifier, as before, using the normalised formant data. What was the effect on the classification accuracy? Why?

In [None]:
#(h)
# statistics about test and training set
test_set= df.sample(frac=.05)
print(test_set.describe())
train_set=df.drop(test_set.index)
print(train_set.describe())

#here, instead of train_set[['F1','F2']], I use train_set[['F1norm','F2norm']]
#because I am testing the normalized ones

neig=KNeighborsClassifier(n_neighbors=3)
neig.fit(train_set[['F1norm','F2norm']], train_set['VOWEL'])
print("train data score:", neig.score(train_set[['F1norm','F2norm']], train_set['VOWEL']))
print("test data score:", neig.score(test_set[['F1norm','F2norm']], test_set['VOWEL']))

# - The accuracy improved.
# this indicates that individuals have different F1 and F2, which affected the model
# after normalisation the data is more separable.

(i) In this exercise we calculated the statistics for normalisation on all the data (train and test sets combined), is this problematic? What would the consequence be when calculating the generalisation error? When deploying this system, how would we perform this normalisation for a new (unseen) speaker?

In [None]:
#(i)
# - The generalisation error could be underestimated because the statistics estimates contained the test samples!
# maybe not normalize the test_set?
# - For an unseen speaker we would need to collect enough data to estimate their statistics first.

(j) In this exercise we did not make use of a validation set, was it necessary? Why?

In [None]:
#(f)
# - The use of a validation set would have been necessary if we wanted to find better hyperparameters for the classifier
#   (e.g. it is possible that using a larger number of neighbours could result in better generalisation error)