# TEAM NAME: GEEK GURUS

## Problem Statement: 
You are tasked with developing a voice-based profile unlock system - a smart voice
lock that can accurately identify and open a person's profile based on their voice. The
system will be used to grant secure access to individual user profiles in various
applications, devices, or online platforms. Participants are expected to build a robust
and reliable software solution that can distinguish between different users by
analysing their unique voice characteristics.

### REQUIREMENTS:
● Voice Data Collection :- 18 precollected dataset and 7 individual dataset.\
● Voice Feature Extraction :- Short Time Fourier Transform(STFT) followed by converting amplitude to dB scale and finalling plotting spectogram of each sample.\
● Machine Learning Model:- Trained Convolutional Neural Network.\
● Simple user profile management

## Results:
### ● Trained a CNN model for Voice Recognition that is 92.58% accurate in predicting the results.


#### Importing necessary modules like numpy, pandas, matplotlib, tensorflow, PIL, sklearn and pickle.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import os
from PIL import Image
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pickle

#### Loading the dataset.

In [3]:
dataset_path="./train/images/"
folders=os.listdir(dataset_path)

# Initializing training and test dataset
X_train=[]
y_train=[]
X_test=[]
y_test=[]

# Split the dataset into training and test set.
num=np.random.rand(3960)
mask=num<0.2
split=mask.astype(int)

i=0
for dirs in folders:
    for img in os.listdir(str(dataset_path+dirs)):
        image=Image.open(str(dataset_path+dirs+'/'+img))
        new_img=image.resize((200,200))
        tmp_array=np.array(new_img)/255.
        if split[i]==0:
            X_train.append(tmp_array)
            y_train.append(str(dirs))
        else:
            X_test.append(tmp_array)
            y_test.append(str(dirs))
        
        i=i+1

#### Encoding dependent variables using Label-Encoding.

In [4]:
dict={}
i=0
for val in folders:
    dict[val]=i
    i=i+1

dict

{'aew': 0,
 'ahw': 1,
 'aup': 2,
 'awb': 3,
 'axb': 4,
 'bdl': 5,
 'Chaitanya': 6,
 'clb': 7,
 'eey': 8,
 'fem': 9,
 'gka': 10,
 'Harsh': 11,
 'Himanshu': 12,
 'jmk': 13,
 'Krupesh': 14,
 'ksp': 15,
 'ljm': 16,
 'lnh': 17,
 'Natvar': 18,
 'rms': 19,
 'rxr': 20,
 'Shashank': 21,
 'slp': 22,
 'slt': 23,
 'Takshay': 24}

#### Labelling y_train and y_test.

In [5]:
i=0
for val in y_train:
    y_train[i]=dict[y_train[i]]
    i=i+1

i=0
for val in y_test:
    y_test[i]=dict[y_test[i]]
    i=i+1

#### As we were trying to differentiate people based on the spectogram images of their voice, we thought that convolutional Neural Network would be the best choice to deal with image dataset. The following model involves two convolutional layer and two max pulling as well as two ReLU layers.

Convolutional layer helps increase the computational efficiency and max pulling layers helps us extract the dominant features very important to detect the frequency difference in the voice of two people.

ReLU layer introduce Non-Linearity as demanded by some complex functions.

In [6]:
import tensorflow.keras.layers as tfl

def convolutional_model(input_shape):
    input_img = tf.keras.Input(shape=input_shape)
    Z1=tfl.Conv2D(filters=8,kernel_size=(4,4),strides=(1,1),padding='same')(input_img)
    A1=tfl.ReLU()(Z1)
    P1=tfl.MaxPool2D(pool_size=(8,8),strides=(8,8),padding='same')(A1)
    Z2=tfl.Conv2D(filters=16,kernel_size=(2,2),strides=(1,1),padding='same')(P1)
    A2=tfl.ReLU()(Z2)
    P2=tfl.MaxPool2D(pool_size=(4,4),strides=(4,4),padding='same')(A2)
    F=tfl.Flatten()(P2)
    outputs=tfl.Dense(25,activation='softmax')(F)
    
    model = tf.keras.Model(inputs=input_img, outputs=outputs)
    return model

#### As we were having the dependent variable label encoded, the clear choice of loss funciton was sparse categorical crossentropy.

In [7]:
conv_model = convolutional_model((200, 200, 4))
conv_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

#### We are implementing mini batch algorithm to train our model faster and we are using our test dataset as our cross validation dataset and training the model for 100 epochs.

In [8]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(64)
history = conv_model.fit(train_dataset, epochs=100, validation_data=test_dataset)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## Observations: 
We observed that the model's training accuracy is almost similar to cross validation set accuracy which shows that our model does not overfit and would generalize well on any unseen Audio.


### We are preprocessing the new input sample before feeding it to the model.

In [30]:
image=Image.open(str(r"C:\Users\Shashank\Documents\Tic-Tech-Toe-2023\temp\images\Himanshu\output_audio_0.png"))
new_img=image.resize((200,200))
NewImageTaken=np.array(new_img)/255.

### Instead of training and running the model everytime for predicting new audio, we are creating pickle file which stores the model architecture and weights to be used for the future use.

In [31]:
filename = 'conv_model.sav'
pickle.dump(conv_model, open(filename, 'wb'))
  

### Loading the pickle file :

In [32]:
with open(filename, 'rb') as file:
    load_model = pickle.load(file)

### We are expanding the dimensions of image to make it compatible to model inputs.

In [33]:
y_predicted = load_model.predict(tf.expand_dims(NewImageTaken,axis=0))



In [34]:
y_predicted

array([[1.1872256e-06, 4.1412035e-20, 6.3200451e-07, 1.3269940e-07,
        5.3057758e-14, 4.5136471e-08, 8.5392458e-05, 5.9769088e-19,
        1.8380447e-08, 1.5651175e-11, 2.0856106e-13, 5.3420980e-03,
        9.5312518e-01, 7.2516121e-21, 4.1396145e-02, 2.5276115e-09,
        2.2915698e-09, 2.4920964e-05, 1.7026417e-05, 6.4641296e-20,
        2.9011366e-11, 7.2174117e-08, 4.2623812e-15, 1.7539805e-09,
        7.1127365e-06]], dtype=float32)

### Each element of the predicted vector would represent the likelihood of that labelled being the input audio. We would be selecting that has maximum likelihood and map it to the corresponding dictionary labels.

In [35]:
max_element = float('-inf')
max_element_index = None

for i, sublist in enumerate(y_predicted):
    for j, element in enumerate(sublist):
        if element > max_element:
            max_element = element
            max_element_index = (i, j)

print(f"The maximum element is {max_element} at index {max_element_index}")

The maximum element is 0.9531251788139343 at index (0, 12)


In [36]:
for key, value in dict.items():
    if value == max_element_index[1]:
        result_key = key
        break  

print(key)

Himanshu
