# <font color='red'> **DIGIT RECOGNITION FROM SPEECH UTTERANCES**
#### BY AVI KHANDELWAL

### Using a set of training and test digit utterances which are present in the folders named “training” and “testing” on the drive, we will be performing digit recognition using K-nearest neighbor (KNN) classifier.
The data can be found on this link-
https://drive.google.com/drive/folders/1o3cygZjlxfRTkWmV85HbnRYXRCAJU9AB?usp=sharing

#### **1. Importing Libraries-**



In [10]:
import numpy as np
from math import sin,cos,pi
import matplotlib.pyplot as plt
import warnings
from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from scipy import stats
from sklearn.metrics import confusion_matrix
#!pip install python_speech_features
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
from python_speech_features import ssc
import logging
logger = logging.getLogger()
logger.setLevel(logging.ERROR)
import os, sys

#drive.mount('/content/drive')

warnings.filterwarnings('ignore')

#### **2. Function Definition-**

In [11]:
# While calculating distance between 2 arrays, 2 norm can be used but to evaluate distance between 2 matrices, a different method has been adopted.
# distance between every row of matrix 1 needs to be calculated with every row of matrix 2 and stored as a scalar entry in result matrix.
# For instance if we have mat1 with 3 rows and mat2 with 4 rows (arbitrary columns in both matrices), we will have 12 distance values stored in a 3*4 matrix corresponding to every row in mat1 and mat2.
# once the result matrix in calculated, we will take an average of all values in the result matrix which will give the final answer as distance between 2 matrices.

def dist(arr1,arr2):
  rows1 = len(arr1)
  rows2 = len(arr2)
  mat = np.zeros([rows1,rows2])
  for i in range(rows1):
    for j in range(rows2):
      mat[i][j] = np.linalg.norm(arr1[i]-arr2[j])
  return np.sum(mat)/(rows1*rows2)

#### **3. Loading the data-**

In [22]:
# Parameter values

km = 13 # Value of parameter k in k-means clustering model
kn = 13 
tr = 300 # Number of training files (.wav)
ts = 120 # Number of testing files (.wav)

X_train = np.zeros([tr,km,13])
X_test = np.zeros([ts,km,13])

tr_labels = np.zeros([tr],dtype=int)
ts_act_labels = np.zeros([ts],dtype=int)
ts_pred_labels = np.zeros([ts],dtype=int)
dist_matrix = np.zeros([ts,tr])


path = '/content/drive/My Drive/speech_utterances/training/' 
dirs = os.listdir(path)
i = 0
for fil in dirs:
  (rate,sig) = wav.read(path+fil) # Reading the .wav file from above mentioned path for training data

  tr_labels[i] = int(fil[10]) # Training label is present as 11th character or at 10th index in the name of training file

  mfcc_feat = pd.DataFrame(mfcc(sig,rate)) # Extracting the MFCC features from the speech training file
  # After extracing the mfcc_feat, we can see that it will be a feature matrix of dimension mostly as 88 rows x 13 columns, which can be interpreted as
  # 88 feature vectors with every feature vector having dimension = 13. The number of rows/feature vectors may change from one .wav file to another but the
  # columns will always be 13 which denotes the dimension of every mfcc feature vector.

  clustering = KMeans(n_clusters=km) # Building a k-means clustering model, with k value set as km

  clustering.fit(mfcc_feat) # Fitting the clustering model on all the 88 mfcc feature vectors with dimension as 13 (it could be 88 or more/less than that for some .wav files)

  X_train[i] = clustering.cluster_centers_ # Storing the cluster centers as the representative of the .wav training file, in other words the whole .wav file
  # which earlier was described as a feature matrix of 88 rows x 13 columns, can now be looked at as km number of data points in a 13 dimensional vector space.

  i += 1

# Repeating the exact same process for other .wav testing files-
path = '/content/drive/My Drive/speech_utterances/test/' 
dirs = os.listdir(path)
i = 0
for fil in dirs:
  ts_act_labels[i] = int(fil[10])
  (rate,sig) = wav.read(path+fil)
  mfcc_feat = pd.DataFrame(mfcc(sig,rate))
  clustering = KMeans(n_clusters=km)
  clustering.fit(mfcc_feat)
  X_test[i] = clustering.cluster_centers_
  i += 1

#### **4. Predicting the labels for test files-**

In [23]:
for i in range(ts):
  for j in range(tr):
    dist_matrix[i][j] = dist(X_test[i],X_train[j]) # calculating the distance between the ith test speech file and jth training speech file

for i in range(ts):
  k_indices = np.argpartition(dist_matrix[i],kn) # Finding out the kn number of nearest neibours for every test .wav file
  k_min_indices = k_indices[:kn]
  mode_labels = tr_labels[k_min_indices]
  mode,count = stats.mode(mode_labels)
  ts_pred_labels[i] = int(mode)


#### **5. Result-**

In [24]:
arr = confusion_matrix(ts_act_labels, ts_pred_labels)
print("Confusion Matrix:\n\n",arr)
print("\n")
arr = confusion_matrix(ts_act_labels, ts_pred_labels)
digit_ac = np.zeros(10) 
for i in range(10):
  digit_ac[i] = np.round(arr[i][i]/sum(arr[i])*100,2)
  print("Accuracy for digit class",i,":",digit_ac[i],"%")

ov_pc = round(sum(digit_ac)/len(digit_ac),2)
print("\n")
print("Overall Accuracy: ",ov_pc,"%")

Confusion Matrix:

 [[ 1  1 10  0  0  0  0  0  0  0]
 [ 0 12  0  0  0  0  0  0  0  0]
 [ 0  0 12  0  0  0  0  0  0  0]
 [ 0  0  1  3  0  0  0  0  8  0]
 [ 0  0  3  0  9  0  0  0  0  0]
 [ 1  0  0  0  0  6  1  0  4  0]
 [ 0  0  0  0  0  0 12  0  0  0]
 [ 1  1  3  0  1  0  1  3  2  0]
 [ 0  0  0  0  0  0  0  0 12  0]
 [ 0  2  2  0  0  0  0  0  1  7]]


Accuracy for digit class 0 : 8.33 %
Accuracy for digit class 1 : 100.0 %
Accuracy for digit class 2 : 100.0 %
Accuracy for digit class 3 : 25.0 %
Accuracy for digit class 4 : 75.0 %
Accuracy for digit class 5 : 50.0 %
Accuracy for digit class 6 : 100.0 %
Accuracy for digit class 7 : 25.0 %
Accuracy for digit class 8 : 100.0 %
Accuracy for digit class 9 : 58.33 %


Overall Accuracy:  64.17 %


#### **6. Conclusion-**

1. Digit recognition from speech using machine learning tools like K-Nearest neibours is very trivial. It can be seen from above that prediction accuracy for some digit classes is alright but for some digit classes it is simply unacceptable.

2. Moving forward, a better technique would be to use more sophesticated statistical models like Hidden Markov Model (HMM) which is used to model problems that involve sequential information like speech. 

3. To get even better results, Deep learning based approach making use of CNN's could significantly improve the performance, provided we have a lot of data as opposed to just 300 .wav files we had for training in this project.  