Using voiceprint + secret spoken phrase (which is used for both voice print and text) we can make a private key unique to you and deterministic.

In [22]:
!pip install deterministic-rsa-keygen mtcnn matplotlib scipy librosa numpy pocketsphinx SpeechRecognition pydub

import urllib.request
data_dir="https://raw.githubusercontent.com/TBD54566975/experimental-face-voice-key/main/data"
def download_file(filename):
    urllib.request.urlretrieve(data_dir + "/" + filename, filename)

download_file("test1.jpg")
download_file("test2.jpg")
download_file("voice_mic1.m4a")
download_file("voice_mic2.m4a")
download_file("voice_mic3.m4a")

download_file("voice_mic4.m4a")
download_file("voice_jo1.m4a")
download_file("oli_mic1.m4a")
download_file("oli_mic2.m4a")


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Voice print

Everyone in the world can have a reasonably unique voice print which is hard to spoof, especially if combined with a secret phrase. librosa provides some simple utilities to calculate this. Using https://en.wikipedia.org/wiki/Linear_predictive_coding to provide utterance tolerant fingerprint (not secure enough to be non replayable - needs to be combined with a spoken secret)

In [23]:
import librosa
import numpy as np

def calculate_voiceprint(audio_file, num_coeffs=200):


  # Calculate the linear predictive coefficients (LPCs) for the audio signal
  audio, sr = librosa.load(audio_file)
  lpcs = librosa.lpc(audio, num_coeffs)

  def round_vector(vector, precision):
    rounded_vector = []
    for i in range(len(vector)):
      element = vector[i]
      rounded_element = round(element / precision) * precision
      rounded_vector.append(rounded_element)
      precision += 0.2  # Increase precision by a small amount after each iteration, so we are less sensitive to future predictions
    return rounded_vector
  
  return round_vector(lpcs, 0.5)[:6]


Lets try it out on a few voice files



In [24]:

mic1 = calculate_voiceprint("voice_mic1.m4a")
mic2 = calculate_voiceprint("voice_mic2.m4a")

mic3  = calculate_voiceprint("voice_mic3.m4a")
oli_mic1 = calculate_voiceprint("oli_mic1.m4a")
oli_mic2 = calculate_voiceprint("oli_mic2.m4a")

not_mic = calculate_voiceprint("voice_mic4.m4a")

jo = calculate_voiceprint("voice_jo1.m4a")

print("                    mic1", mic1)
print("                    mic2", mic2)
print("                    mic3", mic3)
print("Mic but different phrase", not_mic)
print("  Oli speaking like mic1", oli_mic1)
print("  Oli speaking like mic2", oli_mic2)
print("                      jo", jo)




                    mic1 [1.0, -2.0999999999999996, 2.6999999999999997, -3.3, 5.199999999999999, -5.999999999999999]
                    mic2 [1.0, -2.0999999999999996, 2.6999999999999997, -3.3, 5.199999999999999, -5.999999999999999]
                    mic3 [1.0, -2.0999999999999996, 2.6999999999999997, -3.3, 5.199999999999999, -5.999999999999999]
Mic but different phrase [1.0, -1.4, 1.7999999999999998, -2.1999999999999997, 2.5999999999999996, -2.9999999999999996]
  Oli speaking like mic1 [1.0, -2.0999999999999996, 1.7999999999999998, -2.1999999999999997, 2.5999999999999996, -2.9999999999999996]
  Oli speaking like mic2 [1.0, -1.4, 0.8999999999999999, -1.0999999999999999, 1.2999999999999998, -2.9999999999999996]
                      jo [1.0, -2.0999999999999996, 2.6999999999999997, -4.3999999999999995, 5.199999999999999, -7.499999999999999]


# Voice to text

Here is some rudimentary voice to text to provide some extra signal

In [25]:
def voice_text(audio_file):
  import speech_recognition as sr
  from pydub import AudioSegment

  audio = AudioSegment.from_file(audio_file, format="m4a")
  raw_data = audio.raw_data
  audio_data = sr.AudioData(raw_data, audio.frame_rate, audio.sample_width)


  r = sr.Recognizer()
  text = r.recognize_sphinx(audio_data)
  print("text detected: " + text)
  return text

# Combine into deterministic seed

In [31]:
def make_seed(voice_file):
  return str(calculate_voiceprint(voice_file)) #+ voice_text(voice_file)

# Encrypt with *voice*

Use the determinisic seed to create a private key

In [32]:
from rsa import generate_key, encrypt, decrypt

secret_key = generate_key(make_seed("voice_mic1.m4a"))

public_key = secret_key.publickey().exportKey("PEM")

# eg round trip:
secret = encrypt("Hello World using voice as key", public_key)

print(secret)




b'OV2M0Phe4wLRShsWbDdHUYkjf6f/JWXZwpXRD4hR3PnUpwcZiGFVUwLmyeaPFyQlVKcG7QFl/8x1Tnm5TJGrn/4P+TwWHLEehbkImomyeOXjjUf/2XLJQU/47ova97eqw9xRKC1Ac68L9pB6okPMcu/Zu1KrOCscCtYPO+SPcYFKcyVxO37nflY4+Xj32BvDoK04ks6jOiSZFRQ9MwbAI3dc1p5aaqxfJma9MQhwm8XNLQBhvKOMwp+YcGTlrmO26q5C0hbr+ZK76FiPNV5WQ/4szVsM/lnN1yPMoDr07SIbP4v0+fcwWpqfrrF0DV61PAMPZvfvW641wjv9OVXcfA=='


Now will use a different photo and voice to ensure we can make the same key and then decrypt

In [33]:

# using the other photo we can make the same key
secret_key = generate_key(make_seed("voice_mic2.m4a"))

private_key = secret_key.exportKey("PEM")

# and we get the secret back (and can use alternative audio if we are clear enough)
decrypt(secret, private_key)





b'Hello World using voice as key'