### 定义一些超参数

其中最大序列长度和采样率通过帧大小与帧位移计算得到可以处理的最大时间：
$$t_{\max}={256\times512\div22050}\approx 6\text{~s}$$

In [20]:
# 模型超参数
# noise_dim = 100
num_classes = 8  # 8个类别
max_seq_length = 256  # 最大序列长度（Mel变换后的帧数）
hop_length = 512  # Mel 帧位移
global_n_fft = 2048  # 帧大小
gloabl_sr = 22050  # 默认降采样为 22050
feature_dim = 128  # 特征维度
# eos_token = feature_dim + 1  # EOS标记的索引
eos_threshold = 0.9  # 判断为EOS的概率

# 训练超参数
batch_size = 32
epochs = 1000
sample_interval = 100  # 每隔多少批次保存一次生成的样本
n_critic = 0.5  # 每训练一轮生成器前先训练判别器的轮数

### 数据处理

这一次我们提取特征频率维度为 128 的 **Mel 谱图特征**。该特征的具体提取方法如下：

1. 预处理: 音频信号首先被分帧，并且通常会应用汉明窗等窗口函数来减少帧边缘的突变效应。
2. 傅里叶变换: 对每一帧执行快速傅里叶变换 (FFT)，得到该帧的频谱。
3. 功率谱: 计算 FFT 结果的幅度平方，得到功率谱。
4. Mel 滤波器组: 将功率谱通过一组 Mel 滤波器组，这组滤波器在 Mel 频率尺度上均匀分布，从而将功率谱转换到 Mel 频率尺度上。
5. 对数转换: 为了压缩动态范围并使数据更加平稳，通常会对每个滤波器输出的结果取对数。

In [21]:
import os
import librosa
import numpy as np

def load_audio(file_path):
    # 加载音频文件
    y, sr = librosa.load(file_path)
    return y, sr

def extract_features(y, sr):
    n_fft = n_fft = min(global_n_fft, len(y))
    # 提取 log_Mel 谱图特征
    mel_spectrogram = librosa.feature.melspectrogram(
        y=y, sr=gloabl_sr, n_mels=feature_dim, n_fft=n_fft, hop_length=hop_length)
    # 取对数
    # log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
    return mel_spectrogram.T

def preprocess_data(label_dir):
    X, y = [], []
    for label in os.listdir(label_dir):
        for file_name in os.listdir(os.path.join(label_dir, label)):
            audio, sr = load_audio(os.path.join(label_dir, label, file_name))
            features = extract_features(audio, sr)
            X.append(features)
            y.append(label)
    return X, y

X, y = preprocess_data('kick_samples')

cat_dict = {
    "Top":          0,
    "Chest":        1,
    "Signature":    2,
    "Stadium":      3,
    "Punchy":       4,
    "808s":         5,
    "Big":          6,
    "Hardstyle":    7,
}

y = list(map(cat_dict.get, y))
y = np.array(y)
# y_cal = to_categorical(y, num_classes=num_classes)

为了之后的音频重建，下面我们定义一个归一化概率转音频的函数。

原理如下：
1. 模型输出 Normalized Mel $x_假$：
   $$x_假=G(z|\lambda)$$
2. 反归一化，得到 Scaled Mel $s$：
   $$s=x_假\cdot(\max_{\rm dB}-\min_{\rm dB})+\min_{\rm dB}$$
3. 从对数能量到线性能量：
   $$l=10^{(s/10)}$$
4. 从 Mel 频率到线性频率：在实践中，我们通常不直接从Mel谱图重建回原始音频，而是使用Mel谱图作为输入来估计线性频谱。这通常通过使用 Griffin-Lim 算法或基于深度学习的方法（如 WaveNet 或 Tacotron）来完成。

In [22]:
def convert_to_dB_energy(generated_mels, ref_value=1.0, top_db=80.0):
    # 将输出转换为 dB 能量
    dB_energy = librosa.power_to_db(generated_mels, ref=ref_value, top_db=top_db)
    return dB_energy

def denormalize_and_convert_to_audio(mel_spectrogram, sr=22050, n_fft=2048, hop_length=512):
    # 反归一化
    max_db = 0.0
    min_db = -80.0
    scaled_mel = (mel_spectrogram * (max_db - min_db)) + min_db
    
    # 从对数能量到线性能量
    linear_mel = librosa.db_to_power(scaled_mel)
    
    # 使用 Griffin-Lim 算法重建相位信息
    audio_reconstructed = librosa.feature.inverse.mel_to_audio(linear_mel, sr=sr, n_fft=n_fft, hop_length=hop_length)
    
    return audio_reconstructed

# 保存音频文件
# librosa.output.write_wav('reconstructed_audio.wav', audio_reconstructed, sr=22050)

### 构建模型

#### 生成器

- **输入**：一组 100 维的噪声数据；一组 1 维的标签数据（通过整数编码）
- 标签嵌入：Embedding 层中第一个参数表示标签数量，第二个表示输出嵌入标签的维度。
- 标签连接：噪声向量与嵌入标签连接，作为新的输入（通过特征轴连接）
- 

In [23]:
import tensorflow as tf
from keras.models import Model
from keras.layers import Input, Dense, GRU, Embedding, Concatenate, Reshape, TimeDistributed, Dropout, Flatten, RepeatVector
from keras.optimizers import Adam
from keras.losses import binary_crossentropy, categorical_crossentropy
from keras.utils import to_categorical


# 共享的标签嵌入层
label_embedding_layer = Embedding(num_classes, 50, name='label_embedding')
flatten_layer = Flatten()

# 创建标签输入
label_input = Input(shape=(1), name='label_input')

# 应用嵌入和展平操作
label_embedding = label_embedding_layer(label_input)
label_embedding = flatten_layer(label_embedding)

# 生成器
def build_generator():
    noise_input = Input(shape=(max_seq_length,feature_dim), name='noise_input')

    # 重复标签以匹配噪声序列的长度
    repeated_label = RepeatVector(max_seq_length)(label_embedding)
    
    # 将标签与噪声序列连接
    gen_input = Concatenate(axis=-1)([noise_input, repeated_label])

    # 使用GRU生成时序数据
    x = GRU(256, return_sequences=True, name='gru_generator_1')(gen_input)
    x = Dropout(0.3)(x)
    gru_output = GRU(256, return_sequences=True, name='gru_generator_2')(x)
    
    # 输出层
    out = TimeDistributed(Dense(feature_dim, activation='relu'), name='fake_output')(gru_output)
    eos_output = TimeDistributed(Dense(1, activation='sigmoid'), name='eos_output')(gru_output)
    # out = Lambda(lambda x: tf.nn.softmax(x, axis=-1))(out)
    
    model = Model(inputs=[noise_input, label_input], outputs=[out, eos_output], name='generator')
    return model

# 判别器
def build_discriminator():
    mel_input = Input(shape=(None, feature_dim), name='mel_input')
    eos_input = Input(shape=(None, 1), name='eos_input')
    gen_input = Concatenate(axis=-1)([mel_input, eos_input])
    
    # 使用GRU处理时序数据
    x = GRU(128, return_sequences=True, name='gru_feature')(gen_input)
    x = Dropout(0.3)(x)
    x = GRU(64, name='gru_category')(x)
    
    # 将GRU输出和嵌入后的标签连接
    x_connected = Concatenate()([x, label_embedding])
    
    # 真伪判断
    validity = Dense(1, activation='sigmoid', name='validity')(x_connected)
    # 类别预测
    label = Dense(num_classes, activation='softmax', name='label')(x)
    
    model = Model(inputs=[mel_input, label_input, eos_input], outputs=[validity, label], name='discriminator')
    return model

# 实例化模型
generator = build_generator()
discriminator = build_discriminator()


# 编译判别器
discriminator.compile(optimizer=Adam(learning_rate=0.0002, beta_1=0.5),
                      loss=['binary_crossentropy', 'categorical_crossentropy'],
                      metrics=['accuracy'])

# 创建AC-GAN
input_noise = Input(shape=(max_seq_length, feature_dim))
input_label = Input(shape=(1,), dtype='int32')
generated_mel, eos_output = generator([input_noise, input_label])
# generated_mel = convert_to_dB_energy(generated_mel)

# 在训练生成器时冻结判别器
discriminator.trainable = False
validity, label_pred = discriminator([generated_mel, input_label, eos_output])

acgan = Model(inputs=[input_noise, input_label], outputs=[validity, label_pred], name='AC-GAN')
acgan.compile(optimizer=Adam(learning_rate=0.002, beta_1=0.5),
              loss=[lambda _, y_pred: binary_crossentropy(tf.ones_like(y_pred), y_pred), 'categorical_crossentropy'],
              metrics=['accuracy'])

# 显示模型结构
print(generator.summary())
print(discriminator.summary())

Model: "generator"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 label_input (InputLayer)       [(None, 1)]          0           []                               
                                                                                                  
 label_embedding (Embedding)    (None, 1, 50)        400         ['label_input[0][0]']            
                                                                                                  
 flatten_3 (Flatten)            (None, 50)           0           ['label_embedding[0][0]']        
                                                                                                  
 noise_input (InputLayer)       [(None, 256, 128)]   0           []                               
                                                                                          

### 模型训练

In [24]:
# 生成随机噪声
def generate_noise():
    return np.random.normal(0, 1, (batch_size, max_seq_length, feature_dim))

# 生成真实数据批次
def generate_real_batch():
    idxs = np.random.randint(0, len(X), batch_size)
    mels = [X[idx] for idx in idxs]
    labels = [y[idx] for idx in idxs]
    eoss = [np.zeros((X[idx].shape[0],1)) for idx in idxs]
    for eos in eoss:
        eos[-1,0] = 1
    return tf.ragged.constant(mels), tf.convert_to_tensor(labels), tf.ragged.constant(eoss)

# 生成假数据批次
def generate_fake_batch():
    noise = generate_noise()
    labels = np.random.randint(0, num_classes, batch_size)
    labels = tf.convert_to_tensor(labels)
    generated_mels, generated_eos = generator.predict([noise, labels])
    # eos 去除
    for i in range(generated_mels.shape[0]):
        eos_indices = np.where(generated_eos[i] >= eos_threshold)[0]
        if len(eos_indices) > 0:
            eos_index = eos_indices[0]
            try:
                generated_mels[i] = generated_mels[i][i, :eos_index]
            except:
                pass
    return generated_mels, labels, generated_eos

# 训练判别器
def train_discriminator(real_mels, real_labels, real_eoss, fake_mels, fake_labels, fake_eoss):
    discriminator.trainable = True
    generator.trainable = False

    # 真假样本标签
    valid_y = tf.ones((batch_size, 1))
    fake_y = tf.zeros((batch_size, 1))

    real_labels_categorical = to_categorical(real_labels, num_classes)
    fake_labels_categorical = to_categorical(fake_labels, num_classes)

    # 确保 real_labels_categorical 和 fake_labels_categorical 都是张量
    real_labels_categorical = tf.convert_to_tensor(real_labels_categorical)
    fake_labels_categorical = tf.convert_to_tensor(fake_labels_categorical)

    # 训练判别器
    d_loss_real = discriminator.train_on_batch(
        [real_mels, real_labels, real_eoss],
        [valid_y, real_labels_categorical])
    d_loss_fake = discriminator.train_on_batch(
        [fake_mels, fake_labels, fake_eoss],
        [fake_y, fake_labels_categorical])
    d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
    return d_loss

# 训练生成器
def train_generator():
    discriminator.trainable = False
    generator.trainable = True

    noise = generate_noise()
    labels = np.random.randint(0, num_classes, batch_size)
    valid_y = np.ones((batch_size, 1))
    
    # 训练生成器
    g_loss = acgan.train_on_batch([noise, labels], [valid_y, to_categorical(labels, num_classes)])
    return g_loss

训练的时候把损失绘制成图

In [25]:
# 主训练循环
d_losses = []
g_losses = []

for epoch in range(epochs):
    print("########### 训练判别器 ###########")
    n = int(n_critic*20)
    for _ in range(n):
        # 生成真实数据批次
        real_mels, real_labels, real_eoss = generate_real_batch()
        # 生成假数据批次
        fake_mels, fake_labels, fake_eoss = generate_fake_batch()
        # 训练判别器
        d_loss = train_discriminator(real_mels, real_labels, real_eoss, fake_mels, fake_labels, fake_eoss)
    
    print("训练生成器")
    # 训练生成器
    for _ in range(10-n):
        g_loss = train_generator()
    
    # 记录损失
    d_losses.append(d_loss)
    g_losses.append(g_loss)
    n_critic = d_loss[0]/(d_loss[0]+g_loss[0])
    if n_critic < 0.35:
        n_critic = n_critic**2
    print(n_critic)
    
    # 打印进度
    print(f"Epoch {epoch+1}/{epochs} [D loss: {d_loss[0]:.3f} - validity_loss: {d_loss[1]:.3f} - label_loss: {d_loss[2]:.3f}]")
    print(f"Epoch {epoch+1}/{epochs} [G loss: {g_loss[0]:.3f} - validity_loss: {g_loss[1]:.3f} - label_loss: {g_loss[2]:.3f}]")
    
    # 每隔一定批次保存生成的样本
    if (epoch + 1) % sample_interval == 0:
        # 保存生成器的权重
        generator.save_weights('generator_weights_2.h5')
        # 可以在这里添加代码来生成一些样本并保存它们


########### 训练判别器 ###########




In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display
display.clear_output(wait=True)

def update_plot(d_losses, g_losses):
    plt.figure(figsize=(10, 5))
    plt.clf()  # 清除之前的图像
    
    # 绘制 D 损失
    plt.plot(d_losses, label='D Loss', color='blue')
    plt.plot(g_losses, label='G Loss', color='red')
    
    # 设置图表标题和标签
    plt.title('Training Losses')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    plt.show()  # 显示图表

update_plot(d_losses, g_losses)
plt.savefig('loss_3.png')