# A truth-false discriminator for speech stories based on multimodal neural networks

# 1 Author

**Student Name**:  Boshi Li

**Student ID**:  221171442



# 2 Problem formulation

Describe the machine learning problem that you want to solve and explain what's interesting about it.

# 3 Methodology

Describe your methodology. Specifically, describe your training task and validation task, and how model performance is defined (i.e. accuracy, confusion matrix, etc). Any other tasks that might help you build your model should also be described here.

# 4 Implemented ML prediction pipelines

Describe the ML prediction pipelines that you will explore. Clearly identify their input and output, stages and format of the intermediate data structures moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. After providing an overview, describe in more detail each one of the stages that you have included in their corresponding subsections (i.e. 4.1 Transformation stage, 4.2 Model stage, 4.3 Ensemble stage).

## 4.1 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
import numpy as np
import parselmouth
import librosa
from transformers import BertModel, BertTokenizer
import os
import pandas as pd


def extract_f0(audio_file):
    sound = parselmouth.Sound(audio_file)
    pitch = sound.to_pitch()
    f0_values = pitch.selected_array['frequency']
    f0_values = f0_values[f0_values != 0]  # 去除无声部分
    return np.mean(f0_values)

def extract_formants(audio_file):
    sound = parselmouth.Sound(audio_file)
    formant = sound.to_formant_burg()
    formant_values = []
    for i in range(1, 6):
        values = [formant.get_value_at_time(i, t) for t in formant.ts()]
        values = [v for v in values if not np.isnan(v)]  # 过滤掉 NaN 值
        if values:
            formant_values.append(np.mean(values))
        else:
            formant_values.append(0)  # 如果没有有效值，用 0 替换
    return formant_values

def extract_intensity(audio_file):
    sound = parselmouth.Sound(audio_file)
    intensity = sound.to_intensity()
    return np.mean(intensity.values)

def extract_mfcc(audio_file, n_mfcc=13):
    y, sr = librosa.load(audio_file, sr=None)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    return mfcc.mean(axis=1)

def extract_audio_features(audio_file):
    mfcc = extract_mfcc(audio_file)
    f0 = extract_f0(audio_file)
    formant = extract_formants(audio_file)
    intensity = extract_intensity(audio_file)
    
    return {
        'mfcc': mfcc,
        'f0': f0,
        'formant': formant,
        'intensity': intensity
    }




# 使用预训练的 BERT 模型提取文本特征
class TextFeatureExtractor:
    def __init__(self):
        # 指定本地模型目录
        model_dir = r'D:\books\machine_learning\project\bert-base-uncased'
        
        # 从本地加载模型和分词器
        self.tokenizer = BertTokenizer.from_pretrained(model_dir)
        self.model = BertModel.from_pretrained(model_dir)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)  # 将模型移动到 GPU
        self.model.eval()  # 设置为评估模式，不训练
        print(self.device)

    def extract_text_features(self, text):
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
        inputs = {key: value.to(self.device) for key, value in inputs.items()}  # 将输入移动到 GPU
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()  # 将结果移动回 CPU

text_extractor = TextFeatureExtractor()

texts = ['hello', 'world']

text_features = np.array([text_extractor.extract_text_features(t) for t in texts])

print(text_features)


data_file=r'D:\books\machine_learning\project\data\CBU0521DD_stories'

df=pd.read_csv(r'D:\books\machine_learning\project\data\CBU0521DD_stories_attributes.csv')


audio_latent = []  # 语音文件列表
texts_latent = []  # 对应的文本列表
labels = []  # 对应的标签，0表示假，1表示真

for i in range(1, 101):
    # 生成文件名
    file_number = str(i).zfill(5)
    file_name = file_number + ".wav"
    audio_file_path = os.path.join(data_file, f"{file_number}.wav")
    text_file_path = os.path.join(data_file, f"{file_number}.txt")

    audio_latents=extract_audio_features(audio_file_path)
    mfcc = audio_latents['mfcc']
    f0 = audio_latents['f0']
    formant = audio_latents['formant']
    intensity = audio_latents['intensity']
    
    if len(mfcc)!= (13,):
        mfcc = np.resize(mfcc, (13,))
    if f0.shape != (1,):
        f0 = np.resize(f0, (1,))
    if len(formant) != (5,):
        formant = np.resize(formant, (5,))
    if intensity.shape != (1,):
        intensity = np.resize(intensity, (1,))


    if i==1:
        print('size of mfcc:',len(audio_latents['mfcc']))
        print(audio_latents['mfcc'])
        print('size of f0:',1)
        print(audio_latents['f0'])
        print('size of formant:',len(audio_latents['formant']))
        print(audio_latents['formant'])
        print('size of intensity:',1)
        print(audio_latents['intensity'])


    concatenated_features = np.concatenate((mfcc, f0, formant, intensity))
    audio_latent.append(concatenated_features)

    texts_latent.append(text_extractor.extract_text_features(text_file_path))

    if (df.loc[df['filename']==file_name]['Language']=='English').bool():
        labels.append(0)
    else:
        labels.append(1)


    

# normalize
audio_latent = (audio_latent - np.mean(audio_latent, axis=0)) / np.std(audio_latent, axis=0)
texts_latent = (texts_latent - np.mean(texts_latent, axis=0)) / np.std(texts_latent, axis=0)


audio_features=torch.tensor(audio_latent, dtype=torch.float32)
text_features=torch.tensor(texts_latent, dtype=torch.float32)
labels=torch.tensor(labels, dtype=torch.long)

cuda
[[-0.15500641  0.09710078 -0.02004043 ...  0.01646413 -0.21695724
  -0.09966885]
 [ 0.33452433  0.0429423   0.04421687 ... -0.3241775  -0.12838718
  -0.23485267]]
size of mfcc: 13
[-6.0513245e+02  1.1406797e+02  3.4237747e+01  2.6508791e+01
  1.4176401e+01 -8.3241564e-01 -2.4442281e-01  1.9840821e+00
  3.5147817e+00  1.4265852e+00  8.3669430e-01  5.7564993e+00
  2.3727574e+00]
size of f0: 1
124.44216273197868
size of formant: 5
[801.4843679519009, 1901.0915959809238, 3143.202140521808, 4180.228769356139, 4755.338048883702]
size of intensity: 1
39.237148015087534


  if (df.loc[df['filename']==file_name]['Language']=='English').bool():


In [8]:
print(audio_features.shape)
print(text_features.shape)

torch.Size([100, 20])
torch.Size([100, 768])


## 4.2 Model stage

Describe the ML model(s) that you will build. Explain why you have chosen them.

In [21]:

# 定义特征映射的 MLP 模型
class FeatureMapper(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(FeatureMapper, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )
    
    def forward(self, x):
        return self.fc(x)

# 定义二分类的 MLP 模型
class Classifier(nn.Module):
    def __init__(self, input_dim):
        super(Classifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )
    
    def forward(self, x):
        return self.fc(x)

# 初始化模型
audio_mapper = FeatureMapper(audio_features.shape[1], 64)
text_mapper = FeatureMapper(text_features.shape[1], 64)
classifier = Classifier(128)

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(list(audio_mapper.parameters()) + list(text_mapper.parameters()) + list(classifier.parameters()), lr=0.001)

# 创建数据集和数据加载器
dataset = TensorDataset(audio_features, text_features, labels)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# 训练模型
for epoch in range(20):
    audio_mapper.train()
    text_mapper.train()
    classifier.train()
    
    for audio, text, label in train_loader:
        optimizer.zero_grad()
        
        audio_mapped = audio_mapper(audio)
        text_mapped = text_mapper(text)
        combined_features = torch.cat((audio_mapped, text_mapped), dim=1)
        
        output = classifier(combined_features)
        loss = criterion(output, label)

         # 加入L2正则化项
        l2_lambda = 0.01
        l2_norm = sum(p.pow(2.0).sum() for p in list(audio_mapper.parameters()) + list(text_mapper.parameters()) + list(classifier.parameters()))
        loss = loss + l2_lambda * l2_norm
        
        loss.backward()
        optimizer.step()
    
    print(f'Epoch {epoch+1}, Training Loss: {loss.item()}')

    # 在测试集上评估模型
    audio_mapper.eval()
    text_mapper.eval()
    classifier.eval()
    
    correct = 0
    total = 0
    with torch.no_grad():
        for audio, text, label in test_loader:
            audio_mapped = audio_mapper(audio)
            text_mapped = text_mapper(text)
            combined_features = torch.cat((audio_mapped, text_mapped), dim=1)
            
            output = classifier(combined_features)
            _, predicted = torch.max(output.data, 1)
            total += label.size(0)
            correct += (predicted == label).sum().item()
    
    accuracy = 100 * correct / total
    print(f'Epoch {epoch+1}, Test Accuracy: {accuracy:.2f}%')


    

    



Epoch 1, Training Loss: 2.1478028297424316
Epoch 1, Test Accuracy: 45.00%
Epoch 2, Training Loss: 1.6925914287567139
Epoch 2, Test Accuracy: 60.00%
Epoch 3, Training Loss: 1.3941954374313354
Epoch 3, Test Accuracy: 70.00%
Epoch 4, Training Loss: 1.2448742389678955
Epoch 4, Test Accuracy: 75.00%
Epoch 5, Training Loss: 0.9970323443412781
Epoch 5, Test Accuracy: 70.00%
Epoch 6, Training Loss: 0.7129600048065186
Epoch 6, Test Accuracy: 65.00%
Epoch 7, Training Loss: 0.6703973412513733
Epoch 7, Test Accuracy: 75.00%
Epoch 8, Training Loss: 0.5839830636978149
Epoch 8, Test Accuracy: 70.00%
Epoch 9, Training Loss: 0.6772456765174866
Epoch 9, Test Accuracy: 70.00%
Epoch 10, Training Loss: 0.4170570373535156
Epoch 10, Test Accuracy: 70.00%
Epoch 11, Training Loss: 0.4402984380722046
Epoch 11, Test Accuracy: 80.00%
Epoch 12, Training Loss: 0.47722959518432617
Epoch 12, Test Accuracy: 70.00%
Epoch 13, Training Loss: 0.36313578486442566
Epoch 13, Test Accuracy: 65.00%
Epoch 14, Training Loss: 0.3

## 4.3 Ensemble stage

Describe any ensemble approach you might have included. Explain why you have chosen them.

# 5 Dataset

Describe the datasets that you will create to build and evaluate your models. Your datasets need to be based on our MLEnd Deception Dataset. After describing the datasets, build them here. You can explore and visualise the datasets here as well. 

If you are building separate training and validatio datasets, do it here. Explain clearly how you are building such datasets, how you are ensuring that they serve their purpose (i.e. they are independent and consist of IID samples) and any limitations you might think of. It is always important to identify any limitations as early as possible. The scope and validity of your conclusions will depend on your ability to understand the limitations of your approach.

If you are exploring different datasets, create different subsections for each dataset and give them a name (e.g. 5.1 Dataset A, 5.2 Dataset B, 5.3 Dataset 5.3) .



# 6 Experiments and results

Carry out your experiments here. Analyse and explain your results. Unexplained results are worthless.

# 7 Conclusions

Your conclusions, suggestions for improvements, etc should go here.

# 8 References

Acknowledge others here (books, papers, repositories, libraries, tools) 