# Talking To AI-Generated People | Fake Faces, Script, Voice and Lip-Sync Animation
## I combined different state-of-the-art image and speech generation neural networks into one single Google Colab Notebook so that we can generate a random fake person's talking head video replying to our text prompt input.

### To run, simply connect to a GPU-instance from the menu Runtime->Change runtime type. Then press Run All under Runtime menu. Text prompt will appear at the bottom of this page (running first time might take upto 10 minutes in setup/installation).


#### Credits for different Tools/Repositories used:-
1) Face Generation - www.thispersondoesnotexist.com - Nvidia StyleGAN2
2) Text Generation - www.textsynth.org - OpenAI GPT-2
3) Speech-to-Text Conversion - https://github.com/NVIDIA/flowtron - Flowtron
4) Lip Animation - https://github.com/Rudrabha/LipGAN - LipGAN


#### TODO improvements (any volunteers??) :-
1) Use motion model to animate the face before performing lip-sync.
2) Use the newer GPT-3 model for better, more coherent text responses.


# Step 1: Get an image of a fake person from This-Person-Does-Not-Exist
Install selenium and chromium webdriver dependencies

In [None]:
!rm -r sample_data
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

Download a fake person face from https://thispersondoesnotexist.com/ using the following code. If you want a different face, rerun this code cell until you like one. 

Note that the current speech generation model only outputs a female voice, so you may want to pick the faces appropriately. 

In [None]:
from selenium.webdriver.common.action_chains import ActionChains
driver.get("https://thispersondoesnotexist.com/")
import time 
time.sleep(5)
button = driver.find_element_by_id('saveButton')
ActionChains(driver).move_to_element(button).click(button).perform()
time.sleep(4)
from IPython.display import Image
Image('person.jpg')

# Step 2: Generate response script with Text Synth

In [None]:
# prompt = input("Ask this person a question: ")
prompt = 'Hi there, do you know what the time is?'

from selenium.webdriver.common.keys import Keys

driver.get("http://textsynth.org/")
driver.implicitly_wait(10)
inputElement = driver.find_element_by_id('input_text')
inputElement.click()
inputElement.clear()
inputElement.send_keys(prompt)
button = driver.find_element_by_id('submit_button')
ActionChains(driver).move_to_element(button).click(button).perform()
time.sleep(10)
responseElement = driver.find_element_by_id('gtext')
response = responseElement.text
response = response[len(prompt):].replace('\n', ' ')
print(response)

# Step 3: Convert response text to speech with FlowTron
First, clone the Flowtron Repository and install the requirements (this may take upto 3-4 minutes)

In [None]:
!git clone https://github.com/NVIDIA/flowtron.git
%cd flowtron
!git submodule update --init
%cd tacotron2
!git submodule update --init
%cd ..

In [None]:
!pip install virtualenv
!virtualenv flowtronenv

In [None]:
!source flowtronenv/bin/activate; pip install numpy==1.16.4 inflect==0.2.5 librosa==0.6.0 scipy==1.0.0 tensorboardX==1.1 Unidecode==1.0.22 pillow matplotlib numba==0.48; pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Download Pre-Trained Models

In [None]:
!wget -N  -q https://raw.githubusercontent.com/yhgon/colab_utils/master/gfile.py
!mkdir models
!python gfile.py -u 'https://drive.google.com/open?id=1KhJcPawFgmfvwV7tQAOeC253rYstLrs8' -f 'models/flowtron_libritts.pt'
!python gfile.py -u 'https://drive.google.com/open?id=1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-' -f 'models/flowtron_ljs.pt'
!python gfile.py -u 'https://drive.google.com/open?id=1Rm5rV5XaWWiUbIpg5385l5sh68z2bVOE' -f 'models/waveglow_256channels_v4.pt'

Inference Demo

In [None]:
%cd /content/flowtron
tts_text = response.replace('\n',' ').replace('"','')
print(tts_text)
!source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "$tts_text" -i 0

!cp './results/sid0_sigma0.5.wav' ./..
%cd ..
!mv './sid0_sigma0.5.wav' './speech.wav'

from IPython.display import Audio
sound_file = './results/sid0_sigma0.5.wav'
Audio(sound_file, autoplay=True)

# Step 4: Create talking head video with LipGAN

In [None]:
%cd /content
!git clone https://github.com/Rudrabha/LipGAN.git   --branch fully_pythonic --single-branch
%cd LipGAN

In [None]:
!pip install git+https://www.github.com/keras-team/keras-contrib.git; pip uninstall -y tensorflow tensorflow-gpu; pip install -U numpy; pip install tensorflow-gpu==1.14.0; pip install -U scipy

Download the pre-trained LipGAN model and the Face Detector file

In [None]:
!wget -N  -q https://raw.githubusercontent.com/yhgon/colab_utils/master/gfile.py
!python gfile.py -u 'https://drive.google.com/open?id=1DtXY5Ei_V6QjrLwfe7YDrmbSCDu6iru1' -f './logs/lipgan_residual_mel.h5'
!wget 'http://dlib.net/files/mmod_human_face_detector.dat.bz2' -P './logs/'
!bunzip2 './logs/mmod_human_face_detector.dat.bz2'

In [None]:
%cd /content/LipGAN
!python batch_inference.py --checkpoint_path logs/lipgan_residual_mel.h5 --model residual --face "/content/person.jpg" --audio /content/speech.wav --results_dir /content

!ffmpeg -i /content/result_voice.avi /content/result_voice.mp4
from IPython.display import HTML
from base64 import b64encode
mp4 = open('/content/result_voice.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""<video controls><source src="%s" type="video/mp4"></video>""" % data_url)

# Now try it out yourself
Execute the following code cell and this time insert the question yourself in the text prompt. Save the previous results before running again as they will be overridden.

In [None]:
%cd /content/
!rm speech.wav person.jpg result.avi result_voice.avi result_voice.mp4
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get("https://thispersondoesnotexist.com/")
import time
time.sleep(5)
button = driver.find_element_by_id('saveButton')
from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).move_to_element(button).click(button).perform()
time.sleep(4)

from PIL import Image, ImageOps
original_image = Image.open("person.jpg")
size = (256,256)
resized_image = ImageOps.fit(original_image, size, Image.ANTIALIAS)
image = resized_image.convert('RGB')
image.save("person.jpg")

prompt = input("Ask a question: ")

driver.get("http://textsynth.org/")
driver.implicitly_wait(10)
inputElement = driver.find_element_by_id('input_text')
inputElement.click()
inputElement.clear()
inputElement.send_keys(prompt)
button = driver.find_element_by_id('submit_button')
ActionChains(driver).move_to_element(button).click(button).perform()
time.sleep(10)
responseElement = driver.find_element_by_id('gtext')
response = responseElement.text
response = response[len(prompt):].replace('\n', ' ')

%cd /content/flowtron
tts_text = response.replace('\n',' ').replace('"','')
!source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "$tts_text" -i 0

!cp './results/sid0_sigma0.5.wav' ./..
%cd ..
!mv './sid0_sigma0.5.wav' './speech.wav'

%cd /content/LipGAN
!python batch_inference.py --checkpoint_path logs/lipgan_residual_mel.h5 --model residual --face "/content/person.jpg" --audio /content/speech.wav --results_dir /content

!ffmpeg -i /content/result_voice.avi /content/result_voice.mp4
from IPython.display import HTML
from base64 import b64encode
mp4 = open('/content/result_voice.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""<video controls><source src="%s" type="video/mp4"></video>""" % data_url)

# Experimental Code (ignore)
Details: split the input text according to Flowtron's token length to avoid dropping of audio sequences.

In [None]:
%cd /content/flowtron/
import librosa
import numpy as np
import textwrap

response = 'As a health expert, I predict that the pandemic will happen in the 2020s, and it is possible that it may happen in the 2050s. What I think will happen is that pandemic will cause an increase in deaths from diseases that have been eradicated by vaccines or by conventional medicine. Will there still be any deaths from the new viral pathogens? Yes, we will still have many deaths from the new viruses.'
tts_text = response.replace('\n',' ').replace('\'','').replace('\"','')
token_limit = 80
tts_text_wrap = textwrap.wrap(tts_text, token_limit)

for it, tts_input in enumerate(tts_text_wrap):
  print(tts_input)
  
  # TODO: Of course we want to load the model only once for multiple inferences, but the virtualenv session gets 
  # deactivated after every line in Colab, so developing a workaround for that would require some work.
  !source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "$tts_input" -i 0
  !mv './results/sid0_sigma0.5.wav' './results/speech{it}.wav'
  # if it:  
  #   x, sr = librosa.load('./results/speech.wav')
  #   y, sr = librosa.load('./results/sid0_sigma0.5.wav')
  #   z = np.append(x,y)
  #   librosa.output.write_wav('./results/speech.wav', z, sr)
  # else:
  #   !mv './results/sid0_sigma0.5.wav' './results/speech.wav'