#Speech Censor Bot:


---


A simple bot for censoring audio-visual clips. The notebook is inspired by the OpenVINO v2020.1 docs for [Offline Speech Recognition Demo](https://docs.openvinotoolkit.org/latest/_inference_engine_samples_speech_libs_and_demos_Offline_speech_recognition_demo.html) and [another](https://docs.openvinotoolkit.org/latest/_inference_engine_samples_speech_libs_and_demos_Speech_libs_and_demos.html) documentation on inference engine samples. 
The first link explains how the custom *KALDI* models can be used in the application. For demonstration purpose, we use the pre-trained model provided in the OpenVINO package itself.

Users can test/use the notebook to run custom files by editing the **TODO** sections in the code cells. 

**Note: The version for OpenVINO matters as it has been observed that different versions have slighly different information**

To understand the KALDI files, see [this](https://stackoverflow.com/questions/54428601/kaldis-objects-explained-in-laymans-term)

#Section 1 : For Audio + Video

##Install packages and dependencies

In [1]:
#For audio preprocessing and audio-video manipulation
import wave
!apt-get install libsox-fmt-all libsox-dev sox
!pip install ffmpeg

Cloning into '/content/gentle'...
remote: Enumerating objects: 1634, done.[K
Receiving objects:   0% (1/1634)   Receiving objects:   1% (17/1634)   Receiving objects:   2% (33/1634)   Receiving objects:   3% (50/1634)   Receiving objects:   4% (66/1634)   Receiving objects:   5% (82/1634)   Receiving objects:   6% (99/1634)   Receiving objects:   7% (115/1634)   Receiving objects:   8% (131/1634)   Receiving objects:   9% (148/1634)   Receiving objects:  10% (164/1634)   Receiving objects:  11% (180/1634)   Receiving objects:  12% (197/1634)   Receiving objects:  13% (213/1634)   Receiving objects:  14% (229/1634)   Receiving objects:  15% (246/1634)   Receiving objects:  16% (262/1634)   Receiving objects:  17% (278/1634)   Receiving objects:  18% (295/1634)   Receiving objects:  19% (311/1634)   Receiving objects:  20% (327/1634)   Receiving objects:  21% (344/1634)   Receiving objects:  22% (360/1634)   Receiving objects:  23% (376/1634)   Receiving objects:

##Install OpenVINO toolkit and dependencies 
We used a portable version of OpenVINO toolkit for LINUX-based systems, maintained and open-sourced by my friend [Muhammad Ali](https://github.com/alihussainia/OpenDevLibrary). Look into his repository for more insight into the below cell.

In [2]:
!wget "https://raw.githubusercontent.com/alihussainia/OpenDevLibrary/master/openvino_initialization_script.py"
!python openvino_initialization_script.py

--2020-02-25 11:18:50--  https://raw.githubusercontent.com/alihussainia/OpenDevLibrary/master/openvino_initialization_script.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3232 (3.2K) [text/plain]
Saving to: ‘openvino_initialization_script.py’


2020-02-25 11:18:50 (68.3 MB/s) - ‘openvino_initialization_script.py’ saved [3232/3232]

--2020-02-25 11:18:52--  https://storage.googleapis.com/open_vino_public/l_openvino_toolkit_p_2020.1.023.tgz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.31.128, 2607:f8b0:400c:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.31.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 508213676 (485M) [application/x-compressed]
Saving to: ‘l_openvino_toolkit_p

## Initialize the OpenVINO environment and download related files
The bash script will
- Download pre-trained Intel Models
- Create configuration file (needed for making inference on speech)
- Required pre-requisites for LibriSpeech Model (graph file, etc)
- Test **Online** and **Offline** demos to validate if all pre-requisites were installed properly. 

**Note: If working on Google Collab, the online demo may not work and hence the execution may seem to get entrapped into a never ending cycle. To avoid this you must replace the "demo_speech_recognition.sh" file with a custom one.**

So, we tackle this issue by un-commenting the call below. This will replace the original file with a modified one, maintained in the repository.


In [3]:
#!wget -P "/content/" "https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/Demo_Backup_Files/demo_speech_recognition.sh"
#!rm "/opt/intel/openvino_2020.1.023/deployment_tools/demo/demo_speech_recognition.sh" 
#!cp -f "/content/demo_speech_recognition.sh" "/opt/intel/openvino_2020.1.023/deployment_tools/demo/demo_speech_recognition.sh" 

--2020-02-25 11:43:56--  https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/Demo_Backup_Files/demo_speech_recognition.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5672 (5.5K) [text/plain]
Saving to: ‘/content/demo_speech_recognition.sh’


2020-02-25 11:43:56 (49.0 MB/s) - ‘/content/demo_speech_recognition.sh’ saved [5672/5672]



In [16]:
!bash /opt/intel/openvino_2020.1.023/deployment_tools/demo/demo_speech_recognition.sh
# Note: Simply running the bash script using "!" gives permission issues. The issue is being tackled.

/bin/bash: /opt/intel/openvino_2020.1.023/deployment_tools/demo/demo_speech_recognition.sh: Permission denied


##Running Offline DEMO 
*(To use it for custom WAV file, edit the "run_demo.sh" file and add the path to your file)*

The output generated by the *run_demo.sh* is similar to :


>[ INFO ] Using feature transformation /root/openvino_models/ir/intel/lspeech_s5_ext/FP32/lspeech_s5_ext.feature_transform        
[ INFO ] InferenceEngine API ver. 2.1 (build: 37988)        
[ INFO ] Device info:        
[ INFO ] 	CPU: MKLDNNPlugin ver. 2.1        
[ INFO ] Batch size: 8        
[ INFO ] Model loading time: 49.93 ms        
Recognition result:        
**HOW ARE YOU DOING**

We extract this output (in a naive way) by simply asking *sed* method to filter the console output as we wish to use only the text generated from the speech.

Next, we save this output to a txt file for the next step: forced alignment.



In [12]:
import os
# TODO:
#Add the path to your WAV file
!wget "https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/negan.wav"
wav_path = "/content/negan.wav"

--2020-02-25 12:08:40--  https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/negan.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8998990 (8.6M) [application/octet-stream]
Saving to: ‘negan.wav’


2020-02-25 12:08:41 (71.1 MB/s) - ‘negan.wav’ saved [8998990/8998990]

cuss.wav


##Preprocessing audio file
As per the OpenVINO v2020.1 docs [here](https://docs.openvinotoolkit.org/latest/_inference_engine_samples_speech_libs_and_demos_Offline_speech_recognition_demo.html), WAV file needs to be in following format: RIFF WAVE PCM 16bit, 16kHz, 1 channel i.e.,

>Sample size : 16bit    
Sampling Rate : 16kHz    
Number of channels : 1        

We preprocess audio and convert it if needed and replace the old file with new.

In [13]:
def preprocess(org_aud_path):
  tx = wave.open(org_aud_path, 'r')
  print ("Initial Parameters:")
  !sox --i "$org_aud_path"
  if(tx.getnchannels() > 1):
    #Convert stereo to mono
    #and replace the original with new
    !sox "$org_aud_path" processed.wav channels 1
    !rm -r "$org_aud_path"
    !mv "processed.wav" "$org_aud_path"
    print("Converted Stereo to Mono")

  if(tx.getframerate() != 16000):
    #Downsample (if > 16k) and Upsample (if < 16k)
    #and replace the original with new
    !sox "$org_aud_path" processed.wav rate 16000
    !rm -r "$org_aud_path"
    !mv "processed.wav" "$org_aud_path"
    print("Changed sample rate to 16k")

    print("Processed file into the same path with name 'processed.wav' ")

preprocess(wav_path)
print("Update file parameters")
!sox --i "$wav_path"

Initial Parameters:

Input File     : '/content/cuss.wav'
Channels       : 2
Sample Rate    : 48000
Precision      : 16-bit
Duration       : 00:00:30.02 = 1440878 samples ~ 2251.37 CDDA sectors
File Size      : 5.76M
Bit Rate       : 1.54M
Sample Encoding: 16-bit Signed Integer PCM

Converted Stereo to Mono
Changed sample rate to 16k
Processed file into the same path with name 'processed.wav' 
Update file parameters

Input File     : '/content/cuss.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:30.02 = 480293 samples ~ 2251.37 CDDA sectors
File Size      : 961k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM



The demo uses the "how_are_you_doing.wav" audio file stored in the location         
```/opt/intel/openvino/deployment_tools/demo/how_are_you_doing.wav```
This file is fed to the inference engine using the bash file ```run_demo.sh```. Instead of editing another bash file or creating a new one, we rename our WAV file to **how_are_you_doing.wav`** and replace the original file with ours.

In [14]:
%cd "/content/"

#Rename file here OR edit the bash file
!mv "$wav_path" "how_are_you_doing.wav"

#Replace the file for test on custom file by removing it first 
!rm -r "/opt/intel/openvino/deployment_tools/demo/how_are_you_doing.wav"
!cp "/content/how_are_you_doing.wav" "/opt/intel/openvino/deployment_tools/demo/"


/content


##Perform Inference
The OpenVINO dependencies have successfully been installed and the environment has also been initialized. Its time to make the inference ! Run the cell to make inference. As the shell script echoes the result onto the terminal, we use ```sed``` piping to publish our results onto a text file. Another instance of the same command but without this pipe is run, to provide status of the inference.

In [15]:
!/opt/intel/openvino/data_processing/audio/speech_recognition/demos/offline_speech_recognition_demo/run_demo.sh 
#Running again to save the output
!/opt/intel/openvino/data_processing/audio/speech_recognition/demos/offline_speech_recognition_demo/run_demo.sh | sed '1,/Recognition result/d' > /content/out_text.txt 


[setupvars.sh] OpenVINO environment initialized
target_precision = FP32
Using model configuration file /root/openvino_models/ir/intel/lspeech_s5_ext/FP32/speech_lib.cfg
Run Inference Engine offline speech recognition demo

Run ./offline_speech_recognition_app -wave=/opt/intel/openvino/data_processing/audio/speech_recognition/demos/offline_speech_recognition_demo/../../../../../deployment_tools/demo/how_are_you_doing.wav -c=/root/openvino_models/ir/intel/lspeech_s5_ext/FP32/speech_lib.cfg

[ INFO ] Using feature transformation /root/openvino_models/ir/intel/lspeech_s5_ext/FP32/lspeech_s5_ext.feature_transform
[ INFO ] InferenceEngine API ver. 2.1 (build: 37988)
[ INFO ] Device info:
[ INFO ] 	CPU: MKLDNNPlugin ver. 2.1
[ INFO ] Batch size: 8
[ INFO ] Model loading time: 63.46 ms
Recognition result:
OR TO STAMP ACT OR <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> TIME



##Important filepaths
####WAV file path: 
**"/opt/intel/openvino/deployment_tools/demo/how_are_you_doing.wav"**

####Configuration file path: 
**"/root/openvino_models/ir/intel/lspeech_s5_ext/FP32/speech_lib.cfg"**

##Use **gentle** to obtain syncmap
[Gentle](httpshttps://github.com/lowerquality/gentle://) is a forced aligner built on KALDI that automatically generates a synchronization map between a list of text fragments and an audio file containing the narration of the text. In computer science this task is known as forced alignment.

Due to some difficulties, we will be manually obtaining the syncmap files (a json + csv file) from the [site](https://lowerquality.com/gentle/).

In [0]:
#TODO: Add path to your json file
json_path = "/content/align.json"

##Time to censor 


1.   Get the list of profanity words from [here](https://raw.githubusercontent.com/PrashantDandriyal/Google-profanity-words/master/list.txt)
2.   Detect any such word



In [87]:
%cd /content/
!wget "https://raw.githubusercontent.com/PrashantDandriyal/Google-profanity-words/master/list.txt"
profane_text = "/content/list.txt"

/content
--2020-02-25 10:54:50--  https://raw.githubusercontent.com/PrashantDandriyal/Google-profanity-words/master/list.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3621 (3.5K) [text/plain]
Saving to: ‘list.txt.1’


2020-02-25 10:54:50 (83.9 MB/s) - ‘list.txt.1’ saved [3621/3621]



In [0]:
#Make a list out of all the words in the profane_text txt file. 
#This is to avoid repetitive file searching 
with open(profane_text) as f:
  cuss_list = [i.strip() for i in f]
f.close()

#TODO: Add any such word that was supposed to be detected as a profanic word
#The word "F*Ck" has been recognised as "For", so censoring it for the WAV file "negan.wav" 
cuss_list.append("for")

In [0]:
#Parsing the json content to a easily-accesible dictionary
import json
import pandas as pd

with open(json_path, 'r') as f:
    handle = json.load(f)

df = pd.DataFrame(columns=['word', 'start', 'end'])
rows_list = []

for i in handle['words']:
  #print(i)
  dict1 = {}
  if((i['word']).lower() in cuss_list):
    try:
      dict1.update({"word":i['word'], 
                "start":i['start'],
                "end":i['end']})   
      rows_list.append(dict1) 
    except KeyError:    #Sometimes the word is not properly detected and the entry is "'case': 'not-found-in-audio'"
      pass

df = pd.DataFrame(rows_list)
print(df.head())

In [0]:
#Traverse the entire transcript(text converted from speech) and look for any of such cuss word
#If any such word occurs, save its start and stop time (timestamp) t
test_file = open(text_path,"r")

dff = pd.DataFrame(columns=['swear', 'start', 'end'])
swear_dict_list = []

with open(text_path, 'r') as f:
    for line in f:  
      for word in line.split():
        if(word.upper() in cuss_list):    #Convert the word to Upper case and seach int cuss word list
          d = {}
          d.update({"swear":word, 
                "start":df.loc[word].start,
                "end":df.loc[word].end})
          swear_dict_list.append(d)
dff = pd.DataFrame(swear_dict_list)

As we have obtained the durations of all the profane words in the speech, we will now suppress them by muting/fading the slice of audio.

In [91]:
wav_path = "/opt/intel/openvino/deployment_tools/demo/how_are_you_doing.wav"
#Format of SOX command to fade in. Refer to(https://stackoverflow.com/questions/20127095/using-sox-to-change-the-volume-level-of-a-range-of-time-in-an-audio-file)
'''
fade [type] fade-in-length [stop-position(=) [fade-out-length]]

sox -m
    -t wav "|sox -V1 inputfile.wav -t wav - fade t 0 2.2 0.4" 
    -t wav "|sox -V1 inputfile.wav -t wav - trim 1.8 fade t 0.4 3.4 0.4 gain -6 pad 1.8"
    -t wav "|sox -V1 inputfile.wav -t wav - trim 4.8 fade t 0.4 0 0 pad 4.8"
    outputfile.wav gain 9.542
'''
for i in df.index: 
  start = df['start'][i]
  end = df['end'][i]
  duration = end-start
  #The duration of transition period when a fades in and fades out
  trans = 0.3
  fade_start = start-(trans/2)
  act_fade_start = start+(trans/2)
  #fade_start -> act_fade_start -> fade_end -> act_fade_end
  fade_end = end-(trans/2)
  act_fade_end = end+(trans/2)

  fade_duration = duration + trans
  #Note: Don't forget to add the $ before variable names
  print("For ",start,", ", end)
  !sox -m -t wav "|sox -V1 $wav_path -t wav - fade t 0 $act_fade_start $trans" -t wav "|sox -V1 $wav_path -t wav - trim $fade_start fade t $trans $fade_duration $trans gain -40 pad $fade_start" -t wav "|sox -V1 $wav_path -t wav - trim $fade_end fade t $trans 0 0 pad $fade_end" outputfile.wav gain 9.542
  #Replacing the input file with the outputfile
  !rm -r "$wav_path"
  !cp outputfile.wav "$wav_path"
  
  

For  19.5 ,  19.76
sox FAIL fade: fade-out overlaps fade-in
For  22.85 ,  23.220000000000002
For  29.679999000000002 ,  29.699999000000002
sox FAIL fade: fade-out overlaps fade-in


In [0]:
#Copy the final censored audio file to working directory
!cp "$wav_path" "/content/final_audio.wav"

               


        
---
The above section marks an end to the audio censoring. To generated video outputs, get the link to the video file and proceed below.


#Section 2 : Only for Video
Replace the audio of the video with the censored audio by using ffmpeg

In [93]:
### TODO : Get the video file
!wget -P "/content/" "https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/DocsResources/negan_clip_original.mp4"
vid_path = "/content/negan_clip_original.mp4"

--2020-02-25 10:55:26--  https://raw.githubusercontent.com/PrashantDandriyal/Speech-Censor-Bot/master/DocsResources/negan_clip_original.mp4
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 997258 (974K) [application/octet-stream]
Saving to: ‘/content/negan_clip_original.mp4.1’


2020-02-25 10:55:27 (13.2 MB/s) - ‘/content/negan_clip_original.mp4.1’ saved [997258/997258]



In [95]:
!ffmpeg -i "$vid_path" -i "$wav_path" -c:v copy -map 0:v:0 -map 1:a:0 censored_vid.mp4
#The output is generated in the working directory with the name "censored_vid.mp4"

ffmpeg version 3.4.6-0ubuntu0.18.04.1 Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.3.0-16ubuntu3)
  configuration: --prefix=/usr --extra-version=0ubuntu0.18.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --ena