# AI in Debate Moderation

Ok, this is a wild idea! We can both agree though that sometimes debate moderators don't get it right. Of course, they are human! They can't be 100% right. Maybe they are at 80%. Who knows? If the stakes are really high humans must get extremely close to 100% and they can achieve this with the aid of AI. 

## Contents
- [The Use Case](#The-Use-Case)
- [Success Criteria](#Success-Criteria)
- [AI Solutions](#AI-Solutions)
    - [Out-of-the-box AI](#Out-of-the-box-AI)
    - [End-to-End AI](#End-to-End-AI)
    - [AI with Feature Engineering](#AI-with-Feature-Engineering)
    - [Transfer Learning](#Transfer-Learning)
- [Thoughts](#Thoughts)

## The Use Case

A debate can be a high stakes event. Debate moderators must be impartial and be able to exert control when debaters go out-of-control. Impartiality can be achieved through making sure that the debators are allocated equal time. Controlling debaters who interrupt others can only be achieved through muting the violators. Leaving the task of managing the mute buttons can be too much for the moderator. Comes AI to the rescue!


![agent](images/ai_agent.jpg)

The audio feed of the debate moderator and the debate participants is passed through an AI Agent which decides the channel(s) to output. There are lots of technical details to iron out here but we will focus on how AI can achieve this.

## Success Criteria

1. A metric **talk time** which measures cumulative duration of a debater speaking

     a) Classify speaker based on audio input (Speaker Diarization) with a XX% accuracy
     
     b) Classify interruptions with a XX% accuracy
     
     
2. An output lag of no more than X seconds (Edge Computing will help here but we will not discuss it)

The dynamic control to manage equal allocation of time is left to the moderators and they can take advantage of the metric **talk time** to adjust accordingly. The AI agent can also achieve dynamic control via Reinforcement Learning.

## AI Solutions

It all starts with the data which in this case are audio files and labeled. This enables us to build a model that can achieve speaker diarization. There are advanced models out there but we want to build one, a simple one.

![output](images/output.jpg)

A quick look at the labels tell us this is going to be a multi-label classification. Example 43322 and 43323 show us that we can have multiple speakers in an audio file. We can assume some of these are interruptions.

If we would have one speaker per audio file that would be multi-class classification. 

The difference between these two comes in the output layer of the model. Multi-label will have an output of the same size as the speakers and with the sigmoid activation function. Multi-class will have the softmax activation function. 

### Out-of-the-box AI

This is the go to for general ML tasks, baselines or a quick prototype. I can quickly take advantage of [Google Cloud's Automatic Speech Recognition API](https://cloud.google.com/speech-to-text). There are numerous other solutions out there! 

I also consider AutoML here. You can get a model up and running in an hour! An example will be [Google Cloud's AutoML](https://cloud.google.com/automl).


### End-to-End AI

End-to-end AI is difficult to achieve especially when there is a lot of data and it's unstructured like video, image, audio or text. Computing resources like TPU, GPU and lots of RAM are required. You can always turn to cloud providers of your choice but before you even get there you will need to have a lot of data. Good data! There is not agreed magic number on the size of the training data set but end-to-end is data hungry. Start with several thousands.

![endtoend](images/endtoend.jpg)

Expert knowledge of the domain and the input data required for feature engineering and once this done right, simpler and highly accurate models can be used. 

### AI with Feature Engineering

Features engineering falls on a spectrum. It can be as easy as playing with a datetime feature to as complicated as audio signal processing. CNN can do this as part of model for end-to-end AI. In this example we will see audio feature engineering.

In [1]:
import IPython.display as ipd
# % pylab inline

import tensorflow as tf
import tensorflow_hub as hub

import os
import pandas as pd
import numpy as np
import librosa
import glob 
import librosa.display
import random
from datetime import datetime

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV


import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from feature_engineering import extract_features

#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense, Dropout,Activation, Flatten 
#from tensorflow.keras.layers import Convolution2D, MaxPooling2D
#from tensorflow.keras.optimizers import Adam
#from tensorflow.keras import utils 
#from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
#from tensorflow.keras.callbacks import EarlyStopping
#from tensorflow.keras import regularizers

AlreadyExistsError: Another metric with the same name already exists.

In [None]:
print("TF Version: ", tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

In [None]:
voices = os.listdir('data/debate') 
voices = pd.DataFrame(voices, columns=['audio_file'])
voices['file'] = voices['audio_file'].str[:-4].astype('int64')
voices.head()

In [None]:
labels = pd.read_csv('data/speaker.csv', dtype = {'File':'int32', 'Biden':'int32','Trump':'int32','Wallace':'int32'})


In [None]:
labeled_voices = pd.merge(voices, labels, how='inner', left_on='file',right_on='File')

In [None]:
startTime = datetime.now()
features_label = labeled_voices.apply(extract_features, axis=1)
print(datetime.now() - startTime)

In [None]:
!pip install --upgrade tensorflow-hub

### Transfer Learning

Another approach we can use to get highly accurate models with small training datasets is transfer learning. We can find models that have already been trained and transfer the weights the models learnt into our own model. We took advantage of the embeddings provided by [Google VGGish Trained on Youtube 8M Dataset](https://tfhub.dev/google/vggish/1) 

In [None]:
import tensorflow_hub as hub

# Load the model.
model = hub.load('https://tfhub.dev/google/vggish/1')

# Input: 3 seconds of silence as mono 16 kHz waveform samples.
waveform = np.zeros(3 * 16000, dtype=np.float32)

# Run the model, check the output.
embeddings = model(waveform)
embeddings.shape.assert_is_compatible_with([None, 128])

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=100, 
                    validation_data=(X_val, y_val),
                    callbacks=[early_stop])

In [None]:
# Check out our train accuracy and validation accuracy over epochs.
import matplotlib.pyplot as plt
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

# Set figure size.
plt.figure(figsize=(12, 8))

# Generate line plot of training, testing loss over epochs.
plt.plot(train_accuracy, label='Training Accuracy', color='#185fad')
plt.plot(val_accuracy, label='Validation Accuracy', color='orange')

# Set title
plt.title('Training and Validation Accuracy by Epoch', fontsize = 25)
plt.xlabel('Epoch', fontsize = 18)
plt.ylabel('Categorical Crossentropy', fontsize = 18)
plt.xticks(range(0,100,5), range(0,100,5))

plt.legend(fontsize = 18);
plt.show()

## Thoughts

Whilst the idea is wild and not easily implementable, the exercise represents an approach to AI innovation. It is helpful playing with [Kaggle](https://www.kaggle.com/) datasets to sharpen model building skills. The key is experimentation. Scaled experimentation! The deliverable is multiple versions of different models.

To achieve this level of scale a lot of technology will need to be involved. MLOps is emerging as a field to manage scaled AI operations. 

In [None]:
def extract_features(files):
    
    # Sets the name to be the path to where the file is in my computer
    file_name = os.path.join(os.path.abspath('debate')+'/'+str(files.audio_file))

    # Loads the audio file as a floating point time series and assigns the default sample rate
    # Sample rate is set to 22050 by default
    X, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 

    # Generate Mel-frequency cepstral coefficients (MFCCs) from a time series 
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)

    # Generates a Short-time Fourier transform (STFT) to use in the chroma_stft
    stft = np.abs(librosa.stft(X))

    # Computes a chromagram from a waveform or power spectrogram.
    chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)

    # Computes a mel-scaled spectrogram.
    mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)

    # Computes spectral contrast
    contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)

    # Computes the tonal centroid features (tonnetz)
    tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X),
    sr=sample_rate).T,axis=0)
        
    
    # We add also the classes of each file as a label at the end
    label = files.lable

    return mfccs, chroma, mel, contrast, tonnetz, label

In [None]:
startTime = datetime.now()
features_label = labeled_voices.apply(extract_features, axis=1)
print(datetime.now() - startTime)

In [None]:
features_label

In [None]:
features = []
for i in range(0, len(features_label)):
    features.append(np.concatenate((features_label[i][0], features_label[i][1], 
                features_label[i][2], features_label[i][3],
                features_label[i][4]), axis=0))

In [None]:
features[0]

In [None]:
X = np.array(features)

In [None]:
def encode_speakers(biden, trump, wallace):
    encoded = 0
    
    if biden == 1:
        encoded = 1
    elif trump == 1:
        encoded = 2
    else:
        encoded = 3
    
    return encoded

In [None]:
y = labeled_voices.apply(lambda row: encode_speakers(row['Biden'],row['Trump'],row['Wallace']), axis = 1)

In [None]:
#labeled_voices.head()
y.head(20)

In [None]:
y = np.array(y.values)

In [None]:
labelEncoder = LabelEncoder()
y = utils.to_categorical(labelEncoder.fit_transform(y))

In [None]:
X.shape

In [None]:
y.shape

In [None]:
scalar = StandardScaler()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.22, random_state=1) 

In [None]:
X_train = scalar.fit_transform(X_train)
X_val = scalar.transform(X_val)
X_test = scalar.transform(X_test)

In [None]:
model = Sequential()

model.add(Dense(193, input_shape=(193,), activation = 'relu'))
model.add(Dropout(0.1))

model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.25))  

model.add(Dense(8, activation = 'relu'))
model.add(Dropout(0.25))    

model.add(Dense(3, activation = 'softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=100, verbose=1, mode='auto')

In [None]:
history = model.fit(X_train, y_train, batch_size=256, epochs=100, 
                    validation_data=(X_val, y_val),
                    callbacks=[early_stop])

In [None]:
# Check out our train accuracy and validation accuracy over epochs.
import matplotlib.pyplot as plt
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

# Set figure size.
plt.figure(figsize=(12, 8))

# Generate line plot of training, testing loss over epochs.
plt.plot(train_accuracy, label='Training Accuracy', color='#185fad')
plt.plot(val_accuracy, label='Validation Accuracy', color='orange')

# Set title
plt.title('Training and Validation Accuracy by Epoch', fontsize = 25)
plt.xlabel('Epoch', fontsize = 18)
plt.ylabel('Categorical Crossentropy', fontsize = 18)
plt.xticks(range(0,100,5), range(0,100,5))

plt.legend(fontsize = 18);
plt.show()

In [None]:
preds = model.predict_classes(X_test)

In [None]:
preds = labelEncoder.inverse_transform(preds)

In [None]:
y_test

In [None]:
df_test = pd.DataFrame(y_test, columns = ['Biden','Trump','Wallace']) 
df_test['Preds'] = preds

In [None]:
df_test

In [None]:
accurate = df_test[(((df_test['Wallace'] == 1) & (df_test['Preds'] == 3)) | \
        ((df_test['Biden'] == 1) & (df_test['Preds'] == 1)) | \
        ((df_test['Trump'] == 1) & (df_test['Preds'] == 2)))]

In [None]:
round(len(accurate)/len(df_test),3)

In [None]:
to_reshape = pd.read_csv('./speaker.csv', dtype = {'File':'int32', 'Biden':'int32','Trump':'int32','Wallace':'int32'})

In [None]:
reshaped_long = pd.melt(to_reshape, id_vars=['File'], value_vars=['Biden', 'Trump', 'Wallace'])

In [None]:
reshaped_long.head()

In [None]:
reshaped_long.to_csv('long_speaker.csv')