# Face to Face Conversation System using Deep Neural Networks

The project involves the following steps taken by Rahul Sharma with reference from the research paper.<br>

1. Create a dummy dataset(since collection of original dataset will take a lot of time)<br>
2. Create an encoder decoder architecture for the speaking and listening model
3. Implement the conversational model
4. Test the speaking and listening models
5. Take the output of the speaking and listening models
6. Implement Pix2Pix HD GAN to get the final output

In the middle there will be certain tests to perform which will come as the project proceeds.

## 1. Creation of a dummy dataset

The dummy dataset needs the Facial Action Units + Face Pose of the listener and speaker. These are 20 dimensional vectors which are created arbitrarily from the OpenFace software outputs.

!CAUTION!: the data used in this notebook is arbitrarily generated. The final data that will be used will come from read videos of conversations

In [1]:
#importing the reqyired modules
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import random
from sklearn import preprocessing

pathToData = "./videoplayback.csv"
rawData    = pd.read_csv(pathToData)
columns    = list(rawData.columns)
rawData

Unnamed: 0,frame,face_id,timestamp,pose_Rx,pose_Ry,pose_Rz,AU01_r,AU02_r,AU04_r,AU05_r,...,AU12_c,AU14_c,AU15_c,AU17_c,AU20_c,AU23_c,AU25_c,AU26_c,AU28_c,AU45_c
0,1,0,0.000,-0.008,0.413,0.112,1.55,0.00,0.41,0.0,...,0,0,0,1,1,1,0,0,0,1
1,2,0,0.033,-0.018,0.399,0.124,1.58,0.00,0.36,0.0,...,0,0,0,1,1,1,0,0,0,1
2,3,0,0.067,-0.020,0.397,0.133,1.55,0.00,0.34,0.0,...,0,0,0,1,1,1,0,0,0,1
3,4,0,0.100,-0.026,0.396,0.144,1.42,0.00,0.37,0.0,...,0,0,0,1,1,1,0,0,0,1
4,5,0,0.133,-0.031,0.397,0.155,1.24,0.00,0.39,0.0,...,0,0,0,1,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2431,2432,0,81.114,-0.206,-0.859,-0.321,0.87,0.74,0.21,0.0,...,0,1,1,0,0,1,0,0,0,1
2432,2433,0,81.148,-0.191,-0.770,-0.282,0.67,0.45,0.00,0.0,...,1,1,1,0,0,1,0,0,0,1
2433,2434,0,81.181,-0.125,-0.727,-0.246,0.54,0.23,0.00,0.0,...,1,1,1,0,0,1,0,0,0,1
2434,2435,0,81.214,-0.068,-0.708,-0.214,0.50,0.00,0.00,0.0,...,1,1,1,0,0,1,0,0,0,0


In [12]:
#taking the columns where we have the Pose and 17 action units (regression)
filterColumns = columns[3:]
filterColumns = [i for i in filterColumns if "_r" in i or "pose" in i]
filterData    = rawData[filterColumns]
filterData

Unnamed: 0,pose_Rx,pose_Ry,pose_Rz,AU01_r,AU02_r,AU04_r,AU05_r,AU06_r,AU07_r,AU09_r,AU10_r,AU12_r,AU14_r,AU15_r,AU17_r,AU20_r,AU23_r,AU25_r,AU26_r,AU45_r
0,-0.008,0.413,0.112,1.55,0.00,0.41,0.0,0.08,0.00,0.56,0.19,0.00,0.35,0.00,2.11,0.44,0.00,0.00,0.00,0.74
1,-0.018,0.399,0.124,1.58,0.00,0.36,0.0,0.03,0.00,0.55,0.19,0.00,0.36,0.00,2.03,0.41,0.00,0.00,0.00,0.57
2,-0.020,0.397,0.133,1.55,0.00,0.34,0.0,0.01,0.00,0.52,0.22,0.00,0.36,0.00,1.99,0.33,0.07,0.00,0.00,0.56
3,-0.026,0.396,0.144,1.42,0.00,0.37,0.0,0.00,0.00,0.53,0.20,0.00,0.28,0.00,2.05,0.30,0.16,0.00,0.00,0.50
4,-0.031,0.397,0.155,1.24,0.00,0.39,0.0,0.00,0.00,0.54,0.20,0.00,0.30,0.00,2.05,0.23,0.30,0.00,0.00,0.37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2431,-0.206,-0.859,-0.321,0.87,0.74,0.21,0.0,0.06,0.98,0.00,0.95,0.57,0.76,0.00,0.64,0.00,0.04,0.67,1.65,1.48
2432,-0.191,-0.770,-0.282,0.67,0.45,0.00,0.0,0.18,1.26,0.00,1.25,0.76,1.06,0.23,0.47,0.00,0.04,0.08,1.23,1.48
2433,-0.125,-0.727,-0.246,0.54,0.23,0.00,0.0,0.31,1.76,0.00,1.26,0.99,1.49,0.85,0.43,0.00,0.04,0.00,0.70,1.36
2434,-0.068,-0.708,-0.214,0.50,0.00,0.00,0.0,0.43,2.05,0.00,1.16,1.03,1.84,1.39,0.34,0.00,0.00,0.00,0.22,1.10


The filter columns now have the 20 dimensional vector that is needed to go into the encoder decoder architecture. Now, let's create the listener's dummy dataset for X,Y pairs of data points for our Enc-Dec speaking/listening model.

In [20]:
#let's see some information about the data
filterData.describe()

Unnamed: 0,pose_Rx,pose_Ry,pose_Rz,AU01_r,AU02_r,AU04_r,AU05_r,AU06_r,AU07_r,AU09_r,AU10_r,AU12_r,AU14_r,AU15_r,AU17_r,AU20_r,AU23_r,AU25_r,AU26_r,AU45_r
count,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0,2436.0
mean,0.100613,-0.183204,-0.066628,0.503112,0.28672,0.446786,0.069557,0.434269,0.583543,0.14358,1.020205,0.51484,0.449503,0.973247,0.94624,0.165238,0.328354,0.798112,0.680447,0.18188
std,0.138948,0.418088,0.148935,0.776032,0.672829,0.688002,0.165909,0.422902,0.64136,0.284122,0.693222,0.720615,0.581334,1.399184,0.79578,0.316957,0.59971,0.711267,0.646511,0.305462
min,-0.225,-0.906,-0.537,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0225,-0.407,-0.133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.48,0.0,0.0,0.0,0.22,0.0,0.0,0.11,0.09,0.0
50%,0.097,-0.305,-0.078,0.0,0.0,0.09,0.0,0.36,0.34,0.0,0.93,0.13,0.19,0.12,0.73,0.0,0.0,0.66,0.515,0.0
75%,0.188,0.08525,0.022,0.98,0.09,0.58,0.0,0.74,1.03,0.12,1.5025,0.74,0.6925,1.6325,1.84,0.23,0.44,1.37,1.17,0.3
max,1.216,0.56,0.188,4.95,4.98,3.31,1.4,2.29,3.0,2.17,3.26,3.05,2.81,5.0,2.87,2.48,2.25,3.72,2.96,1.87


Based on the description of the data, we will take the min and max values of each column. The values for the dummy dataset will be produced arbitrarily from the range of each column. This means that the dataset above will be the input to the enc-dec model and the dataset that we are generating in this section will be the output. Note that both input and output datasets for the listening model are in 20 dimensions.

In [36]:
n = 2436
dummyData = pd.DataFrame(columns = filterColumns)
for colname in filterColumns:
    maximum = filterData[colname].max()
    minimum = filterData[colname].min()
    valueList = []

    for i in range(2436):
        randomValue = random.uniform(minimum, maximum)
        valueList.append(randomValue)
    
    dummyData[colname] = valueList


In [39]:
#saving the dummyData
pd.DataFrame.to_csv(dummyData, "dummyData.csv")

The dummy data has been created now. Let's define what else do we need for the basic implementation of this project.<br>

The following are the steps taken to proceed with the project:

1. Normalize the dataset
2. Make the X,Y dataset for training purpose
3. Allow the dataset to train on a simple encoder decoder architecture
4. Listening and Speaking models

In [10]:
#normalizing the dataset
'''
To normalize the dataset we will do the following
1. Take the data from videoplayback.csv and make it as our X's
2. Take the data from dummyData.csv and make it as our Y's
3. Using MinMax preprocessing, we normalize the data such that the values
in the columns are in the range 0-1
4. Arrange the data as X, Y points and save
'''

X = pd.read_csv("videoplayback.csv")
Y = pd.read_csv("dummyData.csv")
columns = list(X)[3:]
columns = [i for i in columns if "_c" not in i]

#the columns have been filtered according to the 20 dimensional vector that is the input to the listening model
X = X[columns]
Y = Y[columns]

#now let's normalize the data
x = X.values #numpy version
y = Y.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
y_scaled = min_max_scaler.fit_transform(y)

X = pd.DataFrame(columns = columns, data=x_scaled)
Y = pd.DataFrame(columns = columns, data=y_scaled)


In [11]:
Y

Unnamed: 0,pose_Rx,pose_Ry,pose_Rz,AU01_r,AU02_r,AU04_r,AU05_r,AU06_r,AU07_r,AU09_r,AU10_r,AU12_r,AU14_r,AU15_r,AU17_r,AU20_r,AU23_r,AU25_r,AU26_r,AU45_r
0,0.568861,0.281822,0.311019,0.153453,0.365783,0.598496,0.261272,0.025992,0.344555,0.226159,0.803869,0.239406,0.055679,0.078044,0.053703,0.343496,0.931526,0.467989,0.547483,0.001282
1,0.894728,0.779545,0.209109,0.299008,0.567059,0.471438,0.989183,0.670331,0.838103,0.818745,0.526408,0.399676,0.520989,0.606002,0.488489,0.657296,0.411545,0.297295,0.823270,0.673423
2,0.912080,0.668611,0.508823,0.795814,0.733279,0.295811,0.830664,0.070189,0.316102,0.865085,0.739738,0.338281,0.938943,0.687085,0.532659,0.904625,0.072139,0.307150,0.507092,0.763870
3,0.996936,0.361594,0.572614,0.378407,0.278045,0.616579,0.554806,0.801454,0.038448,0.716251,0.855707,0.481987,0.114082,0.912565,0.958076,0.181587,0.824309,0.047485,0.420253,0.783537
4,0.200535,0.627136,0.713953,0.842567,0.543087,0.917862,0.448008,0.419037,0.908125,0.233131,0.381997,0.833802,0.241735,0.786558,0.437695,0.502995,0.552914,0.492943,0.058189,0.465894
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2431,0.877849,0.092813,0.951932,0.751462,0.232710,0.252750,0.111697,0.174082,0.762427,0.229644,0.589146,0.174012,0.385809,0.500893,0.088675,0.957217,0.219300,0.781094,0.999720,0.074695
2432,0.645355,0.508156,0.755032,0.624700,0.645140,0.647821,0.112859,0.812608,0.972481,0.234959,0.557780,0.715972,0.961083,0.256895,0.390652,0.254783,0.993307,0.557902,0.781266,0.221919
2433,0.964378,0.779269,0.175087,0.604718,0.831826,0.408323,0.182055,0.368927,0.582575,0.991417,0.475013,0.888221,0.089500,0.000611,0.058177,0.722295,0.727677,0.718220,0.381930,0.667552
2434,0.399542,0.929962,0.831572,0.867165,0.723411,0.684961,0.544399,0.412599,0.400675,0.341430,0.127711,0.345616,0.118401,0.203819,0.063236,0.083190,0.468195,0.320476,0.437134,0.488372
