# 1. Data Preparation and Feature Extraction

- First, install & load dependencies.

- Ensure you have a dataset with audio samples for each speaker.

- Define a feature extraction function that - extracts relevant audio features for each speaker. In this case, the extract_feature function uses Mel spectrograms to create a feature vector, which is suitable for speaker identification.

## 1.1 Install Our Dependencies

In [None]:
!pip install tensorflow tensorflow-io matplotlib

## 1.2 Load Dependencies

In [1]:
import os
import io
import numpy as np
import librosa
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import matplotlib.pyplot as plt
from scipy.io.wavfile import write, read as wav_read

## 1.3 Lets Get Our Data In!

In [None]:
from google.colab import files
files.upload()  # This will prompt you to upload the kaggle.json file

os.environ['KAGGLE_CONFIG_DIR'] = "/content/"

!kaggle datasets download -d mfekadu/english-multispeaker-corpus-for-voice-cloning/

!unzip -qq english-multispeaker-corpus-for-voice-cloning.zip

## 1.4 Feature Extraction Function

In [3]:
# Function to extract features from audio file
def extract_feature(file_name):
    """ Extract features from audio file
    Args:
      file_name (str): Path to audio file

    return:
      np.array: Feature vector
    """
    X, sample_rate = librosa.core.load(file_name) # load audio file
    result = np.array([]) # array that stores features
    mel = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T,axis=0) # calc mel spectogram
    result = np.hstack((result, mel)) # insert the mel spect into results arr
    return result # return the feature vector

## 1.45 Create Var for Data Root!

In [4]:
# Variable that holds path to wav files (txts r washed)
DATA_ROOT = '/content/VCTK-Corpus/VCTK-Corpus/wav48'

# 2. Speaker Subset Selection and Preprocessing

- Define a subset of speakers you want to use for training and testing

- Create a function that will process audio files for the subset of speakers

In [31]:
# Define the target speakers in a list
target_speakers = [
    "p225", # lone tone female
    "p228", # medium tone female
    "p236", # high tone female
    "p249", # low tone female
    "p257", # medium tone female
    "p226", # medium tone male
    "p237", # low tone male
    "p241", # medium tone male
    "p304", # low tone male
    "p326"  # low tone male
]

# FUN FACT O(N^2) runtime right? --> ?_?

# Function to process audio files for target speakers
def process_audio_files(data_directory):
    features = []
    labels = []

    for speaker_id in target_speakers:
        speaker_dir = os.path.join(data_directory, speaker_id)
        wav_files = sorted([f for f in os.listdir(speaker_dir) if f.endswith(".wav")])

        # Exclude the last 5 files for unseen data!!!!
        for file_name in wav_files[:-5]:
            file_path = os.path.join(speaker_dir, file_name)
            feature = extract_feature(file_path)
            features.append(feature)
            labels.append(speaker_id)
        print(f'Last 5 for {speaker_id}: {wav_files[-5:]}')
    return features, labels

# Grab features and labels from DATA_ROOT
features, labels = process_audio_files(DATA_ROOT)

Last 5 for p225: ['p225_358.wav', 'p225_359.wav', 'p225_363.wav', 'p225_365.wav', 'p225_366.wav']
Last 5 for p226: ['p226_366.wav', 'p226_367.wav', 'p226_368.wav', 'p226_369.wav', 'p226_370.wav']
Last 5 for p228: ['p228_367.wav', 'p228_368.wav', 'p228_369.wav', 'p228_370.wav', 'p228_371.wav']
Last 5 for p236: ['p236_499.wav', 'p236_500.wav', 'p236_501.wav', 'p236_502.wav', 'p236_503.wav']
Last 5 for p237: ['p237_347.wav', 'p237_348.wav', 'p237_349.wav', 'p237_350.wav', 'p237_351.wav']
Last 5 for p241: ['p241_370.wav', 'p241_371.wav', 'p241_372.wav', 'p241_373.wav', 'p241_374.wav']
Last 5 for p249: ['p249_350.wav', 'p249_351.wav', 'p249_352.wav', 'p249_353.wav', 'p249_354.wav']
Last 5 for p257: ['p257_430.wav', 'p257_431.wav', 'p257_432.wav', 'p257_433.wav', 'p257_434.wav']
Last 5 for p304: ['p304_420.wav', 'p304_421.wav', 'p304_422.wav', 'p304_423.wav', 'p304_424.wav']
Last 5 for p326: ['p326_396.wav', 'p326_397.wav', 'p326_398.wav', 'p326_399.wav', 'p326_400.wav']


- Lets see an example on what features and labels looks like? :D

In [47]:
# print(features[0])  # Print the first feature vector
print(labels[0])    # Print the corresponding label

p225


- All that data just for only one wav file btw. But remember.

- Giant Array = Features
- String "p225" = Label

# 3. Build Model And Train Model

- Instead of using a binary classification model, the speaker classification will be a multiclass model where each class represents a unique speaker.

- The models architecture has to make sure the output layer has neurons equal to number of speaker in the subset. So 10. Also softmax activation from multiclass classification.

## 3.1 Model Architecture

In [33]:
# Model for both Genders if predictions are great for final product!

def create_speaker_model(vector_length=128, num_speakers=9):
    model = Sequential([
        Dense(256, input_shape=(vector_length,), activation='relu'), # 256 neurons, Relu Activation

        Dropout(0.3), # randomly turn off 30% of neruons to prevent overfitting

        Dense(256, activation='relu'),
        Dropout(0.3),
        Dense(128, activation='relu'), # 128 neurons
        Dropout(0.3),
        Dense(num_speakers, activation='softmax')  # For multiclass classification ('softmax')
    ])
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
    return model


- Categorical Crossentropy: Measures the difference between predicted and actual class probabilities for multiclass tasks.

- Adam Optimizer: Adjusts learning rates during training to optimize model accuracy faster.

- Neurons: Basic units that represent learned patterns in the data.

- ReLU: Activation that converts negative values to zero.

- Example of Neuron (From GPT):

"""
A neuron in a neural network is a computational unit that takes in one or more inputs, applies weights, adds a bias, and then passes the result through an activation function to produce an output.

Imagine a neuron designed to predict whether an email is spam or not based on just two features: the number of suspicious keywords and the presence of a link.
"""

- in our case the activation function is ReLu

## 3.2 Train Test Split & One Hot Encoding

In [34]:
from sklearn.preprocessing import OneHotEncoder

# Assuming `X_train` is your feature matrix and `y_train` is your one-hot encoded label matrix
from sklearn.model_selection import train_test_split

np_features = np.array(features)
np_labels = np.array(labels)

# Split your data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(np_features, np_labels, test_size=0.2, random_state=42)

encoder = OneHotEncoder(sparse_output=False,
                        handle_unknown='ignore') # sparse=False for dense output

encoder.fit(y_train.reshape(-1, 1)) # Fit on training labels, reshaped for 2D input

y_train_encoded = encoder.transform(y_train.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.reshape(-1, 1))

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train_encoded.shape)
print("X_val shape:", X_test.shape)
print("y_val shape:", y_test_encoded.shape)

X_train shape: (2942, 128)
y_train shape: (2942, 10)
X_val shape: (736, 128)
y_val shape: (736, 10)


- Before One Hot encoding the model would treat the speaker IDs just as numerical values which leads to incorrect predictions.

- After One Hot Encoding th model now can learn the relationship between audio features and speaker identities! :D

## 3.3 Initialize and View Summary of Model

In [48]:
# Initialize the model
speaker_model = create_speaker_model(vector_length=X_train.shape[1], num_speakers=y_train_encoded.shape[1])

print(speaker_model.summary())

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


None


# 3.4 Train The MLP (Multi-Layer Perceptron) Model

In [None]:
# Train the model
speaker_model.fit(X_train, y_train_encoded, validation_data=(X_test, y_test_encoded), epochs=30, batch_size=32)

 # 4. Evaluate The Model

## 4.1 Make Predictions

In [50]:
y_pred_probs = speaker_model.predict(X_test)  # Get predicted probabilities
y_pred = np.argmax(y_pred_probs, axis=1)      # Convert to class predictions


[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


## 4.2 Import And Calculate Metrics

In [51]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

accuracy = accuracy_score(np.argmax(y_test_encoded, axis=1), y_pred)
precision = precision_score(np.argmax(y_test_encoded, axis=1), y_pred, average='weighted')  # For multiclass
recall = recall_score(np.argmax(y_test_encoded, axis=1), y_pred, average='weighted')      # For multiclass
f1 = f1_score(np.argmax(y_test_encoded, axis=1), y_pred, average='weighted')              # For multiclass

- Accuracy: Measure how often the model correctly predicts the speaker.

- Precision: Check how many of the speakers predicted as a certain class truly belong to that class.

- Recall: Examine how many of the actual instances of a speaker are correctly identified by the model.

- F1-Score: Consider the balance between precision and recall, particularly valuable if there is an imbalance in speaker samples.

In [52]:
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

Accuracy: 0.9538043478260869
Precision: 0.9543011808554899
Recall: 0.9538043478260869
F1-score: 0.9533499422908044


- Results Looking a bit too good to be true... >:/

## 4.3 Classification Report And Confusion Matrix

In [53]:
print(classification_report(np.argmax(y_test_encoded, axis=1), y_pred, target_names=encoder.categories_[0]))  # Report with speaker IDs
cm = confusion_matrix(np.argmax(y_test_encoded, axis=1), y_pred)
print("Confusion Matrix:")
print(cm)

              precision    recall  f1-score   support

        p225       1.00      0.89      0.94        55
        p226       0.95      0.95      0.95        88
        p228       0.92      0.82      0.86        66
        p236       0.96      1.00      0.98        94
        p237       1.00      0.98      0.99        65
        p241       0.96      0.97      0.96        68
        p249       0.93      1.00      0.96        62
        p257       0.88      0.91      0.89        79
        p304       0.98      0.98      0.98        84
        p326       0.99      1.00      0.99        75

    accuracy                           0.95       736
   macro avg       0.96      0.95      0.95       736
weighted avg       0.95      0.95      0.95       736

Confusion Matrix:
[[49  0  1  1  0  0  2  1  0  1]
 [ 0 84  0  0  0  3  0  0  1  0]
 [ 0  0 54  0  0  0  3  9  0  0]
 [ 0  0  0 94  0  0  0  0  0  0]
 [ 0  1  0  0 64  0  0  0  0  0]
 [ 0  2  0  0  0 66  0  0  0  0]
 [ 0  0  0  0  0  0 62  0

hmmmmmmm...... lets test it >:o

## 4.4 TEST IT URSELF!

- Used GPT to generate this amazing test function for me.

In [54]:
import numpy as np
from sklearn.metrics import accuracy_score

def test_speaker_model(model, encoder, data_root, test_cases):
  """
  Tests the speaker model using provided test cases.

  Args:
    model: The trained speaker model.
    encoder: The OneHotEncoder used for label encoding.
    data_root: The path to the dataset root directory.
    test_cases: A dictionary mapping speaker IDs to file names or lists of file names.

  Returns:
    float: The accuracy of the model on the test cases.
  """
  actual_labels = []
  predicted_labels = []

  for speaker_id, file_names in test_cases.items():
    # Handle single file name or list of file names
    if isinstance(file_names, str):
        file_names = [file_names]  # Convert single file name to a list

    for file_name in file_names:
        # get our filepath for user
        file_path = os.path.join(data_root, speaker_id, file_name + ".wav")

        # grab its features
        feature = extract_feature(file_path)

        # reshape so it matches when testing
        feature = feature.reshape(1, -1)

        prediction_probs = model.predict(feature) # grab prob of speaker
        predicted_speaker_index = np.argmax(prediction_probs) # get speaker with highest prob
        predicted_speaker_id = encoder.categories_[0][predicted_speaker_index]

        actual_labels.append(speaker_id) # append real
        predicted_labels.append(predicted_speaker_id) # append predicted

  accuracy = accuracy_score(actual_labels, predicted_labels) # how accurate were we?
  return accuracy

# Define your test cases
test_cases = {
    "p225": ["p225_358", "p225_359", "p225_363", "p225_365", "p225_366"],
    "p226": ["p226_366", "p226_367", "p226_368", "p226_369", "p226_370"],
    "p228": ["p228_367", "p228_368", "p228_369", "p228_370", "p228_371"],
    "p236": ["p236_499", "p236_500", "p236_501", "p236_502", "p236_503"],
    "p237": ["p237_347", "p237_348", "p237_349", "p237_350", "p237_351"],
    "p241": ["p241_370", "p241_371", "p241_372", "p241_373", "p241_374"],
    "p249": ["p249_350", "p249_351", "p249_352", "p249_353", "p249_354"],
    "p257": ["p257_430", "p257_431", "p257_432", "p257_433", "p257_434"],
    "p304": ["p304_420", "p304_421", "p304_422", "p304_423", "p304_424"],
    "p326": ["p326_396", "p326_397", "p326_398", "p326_399", "p326_400"]
}

# Run the test and print the accuracy
accuracy = test_speaker_model(speaker_model, encoder, DATA_ROOT, test_cases)  # Using your trained model and encoder
print(f"Test Accuracy: {accuracy}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 137ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4

- I am so shocked at how well this model is doing compared to before. It was worth investing time into this!

# 5. Save Model! :D

- Now we can use the model in streamlit, gradio, or wherever.

In [56]:
speaker_model.save('CUR_speaker_model.h5')



# CONTINUING MALE FEMALE MODEL

- this most likely will not be used since the model I am currently using is doing really good but just in case ill add it here...

In [None]:
# For Male Model
Male_speakers = [
    "p226", # medium tone male
    "p237", # low tone male
    "p241", # medium tone male
    "p304", # low tone male
    "p326"  # low tone male
]

# Function to process audio files for target female speakers
def m_process_audio_files(data_directory):
    features = []
    labels = []

    # lets iterate shall we? :D
    for speaker_id in Male_speakers:

        # get path associated to speaker_id
        speaker_dir = os.path.join(data_directory, speaker_id)

        # iterate again? >:D  (btw we itertating thru individual folders)
        for file_name in os.listdir(speaker_dir):

            # just wanna make sure we get only WAV
            if file_name.endswith(".wav"):

                # get file path associated to speaker
                file_path = os.path.join(speaker_dir, file_name)

                # extract features from speakers file_path (of speaker)
                feature = extract_feature(file_path)

                # append the features to list
                features.append(feature)

                # append lable to list (ex: p225)
                labels.append(speaker_id)

    return features, labels

# Grab features and labels from DATA_ROOT
m_features, m_labels = m_process_audio_files(DATA_ROOT)

In [None]:
# For Female Model

Female_speakers = [
    "p225", # lone tone female
    "p228", # medium tone female
    "p236", # high tone female
    "p249", # low tone female
    "p257", # medium tone female
]

# Function to process audio files for target female speakers
def f_process_audio_files(data_directory):
    features = []
    labels = []

    # lets iterate shall we? :D
    for speaker_id in Female_speakers:

        # get path associated to speaker_id
        speaker_dir = os.path.join(data_directory, speaker_id)

        # iterate again? >:D  (btw we itertating thru individual folders)
        for file_name in os.listdir(speaker_dir):

            # just wanna make sure we get only WAV
            if file_name.endswith(".wav"):

                # get file path associated to speaker
                file_path = os.path.join(speaker_dir, file_name)

                # extract features from speakers file_path (of speaker)
                feature = extract_feature(file_path)

                # append the features to list
                features.append(feature)

                # append lable to list (ex: p225)
                labels.append(speaker_id)

    return features, labels

# Grab features and labels from DATA_ROOT
f_features, f_labels = f_process_audio_files(DATA_ROOT)

- For the females

In [None]:
# print(f_features[0])  # Print the first feature vector
print(f_labels[0])    # Print the corresponding label

p225


- For the males

In [None]:
# print(m_features[0])  # Print the first feature vector
print(m_labels[0])    # Print the corresponding label

p226


In [None]:
def M_F_create_speaker_model(vector_length=128, num_speakers=5):
    model = Sequential([
        Dense(256, input_shape=(vector_length,), activation='relu'), # 256 neurons, Relu Activation

        Dropout(0.3), # randomly turn off 30% of neruons to prevent overfitting

        Dense(256, activation='relu'),
        Dropout(0.3),
        Dense(128, activation='relu'), # 128 neurons
        Dropout(0.3),
        Dense(num_speakers, activation='softmax')  # For multiclass classification ('softmax')
    ])
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
    return model

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Assuming `X_train` is your feature matrix and `y_train` is your one-hot encoded label matrix
from sklearn.model_selection import train_test_split

f_np_features = np.array(f_features)
f_np_labels = np.array(f_labels)

# Split your data into training and validation sets
f_X_train, f_X_test, f_y_train, f_y_test = train_test_split(f_np_features, f_np_labels, test_size=0.2, random_state=42)

f_encoder = OneHotEncoder(sparse_output=False,
                        handle_unknown='ignore') # sparse=False for dense output

f_encoder.fit(f_y_train.reshape(-1, 1)) # Fit on training labels, reshaped for 2D input

f_y_train_encoded = f_encoder.transform(f_y_train.reshape(-1, 1))
f_y_test_encoded = f_encoder.transform(f_y_test.reshape(-1, 1))

print("Female X_train shape:", f_X_train.shape)
print("Female y_train shape:", f_y_train_encoded.shape)
print("Female X_val shape:", f_X_test.shape)
print("Female y_val shape:", f_y_test_encoded.shape)

Female X_train shape: (1486, 128)
Female y_train shape: (1486, 5)
Female X_val shape: (372, 128)
Female y_val shape: (372, 5)


In [None]:
# Initialize the model
female_speaker_model = M_F_create_speaker_model(vector_length=f_X_train.shape[1], num_speakers=f_y_train_encoded.shape[1])

print(female_speaker_model.summary())

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


None


In [None]:
# Train the female model
female_speaker_model.fit(f_X_train, f_y_train_encoded, validation_data=(f_X_test, f_y_test_encoded), epochs=30, batch_size=32)

Epoch 1/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 53ms/step - accuracy: 0.4504 - loss: 1.5939 - val_accuracy: 0.7796 - val_loss: 0.5938
Epoch 2/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7447 - loss: 0.6864 - val_accuracy: 0.8575 - val_loss: 0.4203
Epoch 3/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8022 - loss: 0.5287 - val_accuracy: 0.8656 - val_loss: 0.3639
Epoch 4/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8131 - loss: 0.4900 - val_accuracy: 0.8925 - val_loss: 0.3024
Epoch 5/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8290 - loss: 0.4025 - val_accuracy: 0.9005 - val_loss: 0.2718
Epoch 6/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8560 - loss: 0.3919 - val_accuracy: 0.9140 - val_loss: 0.2207
Epoch 7/30
[1m47/47[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x7c0bd894d7b0>

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Assuming `X_train` is your feature matrix and `y_train` is your one-hot encoded label matrix
from sklearn.model_selection import train_test_split

m_np_features = np.array(m_features)
m_np_labels = np.array(m_labels)

# Split your data into training and validation sets
m_X_train, m_X_test, m_y_train, m_y_test = train_test_split(m_np_features, m_np_labels, test_size=0.2, random_state=42)

m_encoder = OneHotEncoder(sparse_output=False,
                        handle_unknown='ignore') # sparse=False for dense output

m_encoder.fit(m_y_train.reshape(-1, 1)) # Fit on training labels, reshaped for 2D input

m_y_train_encoded = m_encoder.transform(m_y_train.reshape(-1, 1))
m_y_test_encoded = m_encoder.transform(m_y_test.reshape(-1, 1))

print("Male X_train shape:", m_X_train.shape)
print("Male y_train shape:", m_y_train_encoded.shape)
print("Male X_val shape:", m_X_test.shape)
print("Male y_val shape:", m_y_test_encoded.shape)

Male X_train shape: (1496, 128)
Male y_train shape: (1496, 5)
Male X_val shape: (374, 128)
Male y_val shape: (374, 5)


In [None]:
# Initialize the model
male_speaker_model = M_F_create_speaker_model(vector_length=m_X_train.shape[1], num_speakers=m_y_train_encoded.shape[1])

print(male_speaker_model.summary())

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


None


In [None]:
# Train the Male model
male_speaker_model.fit(m_X_train, m_y_train_encoded, validation_data=(m_X_test, m_y_test_encoded), epochs=30, batch_size=32)

Epoch 1/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 111ms/step - accuracy: 0.4677 - loss: 1.3097 - val_accuracy: 0.9144 - val_loss: 0.4009
Epoch 2/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9149 - loss: 0.4052 - val_accuracy: 0.9599 - val_loss: 0.1418
Epoch 3/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9415 - loss: 0.2036 - val_accuracy: 0.9572 - val_loss: 0.1065
Epoch 4/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9492 - loss: 0.1530 - val_accuracy: 0.9572 - val_loss: 0.1066
Epoch 5/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9624 - loss: 0.1135 - val_accuracy: 0.9706 - val_loss: 0.0834
Epoch 6/30
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9626 - loss: 0.1026 - val_accuracy: 0.9652 - val_loss: 0.0697
Epoch 7/30
[1m47/47[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x7c0c04306380>

# 5. Save F/M Model!

- Now we can use the model in streamlit, gradio, or wherever.

In [43]:
# female_speaker_model.save('female_speaker_model.h5')
# male_speaker_model.save('male_speaker_model.h5')

