## Speaker Recognition Pre-processing
- Loads files
- MFCC
    - The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis.
    - Where n is the index of the cepstral coefficient and Sm is the output of an M-channel filterbank. **The number of mel cepstrum coefficients, M, is typically chosen as (10-15)**. The set of coefficients calculated for each frame is called a feature vector. These acoustic vectors can be used to represent and recognize the voice characteristic of the speaker. Therefore each input utterance is transformed into a sequence of acoustic vectors . The next section describes how these acoustic vectors can be used to represent and recognize the voice characteristic of a speaker.
- Generates .csv files

In [1]:
import os
import numpy as np
import pandas as pd
import scipy.io.wavfile as wav
from python_speech_features import mfcc

In [2]:
root_dir = 'speaker_recognition/'
iden_split = root_dir + 'iden_split.txt'
utter_dir = root_dir + 'wav/'

In [3]:
with open(iden_split, 'r') as f:
    idens = f.read()
f.close()

In [4]:
idens = [line[2:] for line in idens.split('\n')]

In [5]:
len(idens)

153517

In [7]:
data = []
s_ids = []
seen_ids = set([])
utter_ids = []

for i, iden in enumerate(idens[:]):  
    
    s_id = iden[:7]
    
    seen_ids.add(s_id)
    if len(seen_ids) > 50: break
        
    rate, sig = wav.read(utter_dir + iden)
    mfcc_feat = mfcc(sig, rate, winfunc=np.hamming, winlen=0.05, winstep=0.02, nfft=1024)
    data.extend(mfcc_feat)
    s_ids.extend([[s_id]]*len(mfcc_feat))
    utter_ids.extend([[i]]*len(mfcc_feat))

In [8]:
data = np.array(data)
s_ids = np.array(s_ids)
utter_ids = np.array(utter_ids)

In [9]:
data_df = pd.DataFrame(data=data, columns=['MFCC_'+str(i) for i in range(13)])
ids_df = pd.DataFrame(data=s_ids, columns=['s_id'])
utter_ids_df = pd.DataFrame(data=utter_ids, columns=['utter_id'])

In [10]:
data_concact = pd.concat([utter_ids_df, ids_df, data_df], axis=1)

In [11]:
data_concact.dtypes

utter_id      int64
s_id         object
MFCC_0      float64
MFCC_1      float64
MFCC_2      float64
MFCC_3      float64
MFCC_4      float64
MFCC_5      float64
MFCC_6      float64
MFCC_7      float64
MFCC_8      float64
MFCC_9      float64
MFCC_10     float64
MFCC_11     float64
MFCC_12     float64
dtype: object

In [12]:
data_concact.to_csv(root_dir + 'csv/utter_MFCC.csv')

### Visualization of the data

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2).fit_transform(data[:])

In [None]:
fig = plt.figure(figsize=(10, 10))
plt.scatter(X_embedded[:,0], e[:, 1], color='grey', s=15, alpha=0.75, marker='^')

## Vector Quantization
- Some terms and how
    - **Codebook**: each category (speaker) has its own codebook, which is a set of codewords; <br>
    - **Codeword**: for each speaker, its feature vectors are clustered (using different methods), and a codeword is the centroid of a cluster; <br>
    - **Pridict an unknown speaker**: compare with each codebook (unclearly written)
        - mean distortion measure (average on all codewords) <- potentially better
        - min distortion measure (choosing min codeword)
        - Euclidean distance, absolute or sqaured
- Experiments
    - different cluster function (how the codewords distribute)
        - K-means
        - Bisecting k-means: nature of hierarchical clustering
        - Gaussian Mixture Model (GMM)
    - size of codebook [0:200]
        - over accuracy, prf
        - over distortion, etc