# Text to sfx

The goal is to take a string input (max 4 lines) and return reasonable sound effects to go with the text. 

For my first attempt I will go with something fairly simple.

I have a whole bunch of sound effect files, each with a descriptive title. I convert those titles into word2vec vectors. I then do the same for the input text. Then I find do a cosine similarity amongst the vectors to find which ones are most similar to the input text. 

I also want 2 sound effect files to play, so one more thing I need to make sure of is that these two sound effect files aren't too similar to each other. I can make sure the files are different enough by clustering the most similar vectors and looking for the top two recommendations from different clusters. 




In [1]:

import numpy as np
import os
from gensim.models.keyedvectors import KeyedVectors
import sklearn
from sklearn.cluster import AgglomerativeClustering,DBSCAN

from my_text_process_scripts import *
import json

data_folder="../data/"
tags_folder="../sound_effects/tags"
sound_folder="../sound_effects/sounds/"




# diary entry

In [2]:


input_text="""I just moved to New York, so it’s been a little tough meeting people. It seems like everyone already has their own group of friends. So I’m trying to become more of a ‘yes’ person and do things I normally wouldn’t do. Like I came to the park today instead of sitting at home. And I went to my first hockey game yesterday. And I joined a dodgeball team on Thursday nights. Dodgeball is a lot more pressure than I thought it would be. I try to hang back and not throw the ball, but then usually I’m the last one and everyone is aiming at me. The only consolation is knowing that it’s going to be over in two seconds. And after the game we all go to the bar. Our team name is ‘We Throw Things and Drink.’” 
"""

# How it works for sound effect retrieval


I have a dictionary of sound effect files and their w2v vectors. I was hoping to just convert the input text a vector and then see to which sfx vector it was most similar to.

One thing I've noticed though is that nouns are very important. 

case 1: 
Guy goes into the bar with friends to watch sports. 

sfx with high similarity is ["baby","happy","friendly"]
two of those words sort of match with the setting, but baby definitely does not. 
so how shall I deal with this?

need to determine which one is a noun. 
use spacy for the input text
would spacy work for the flags? not in sentence form. 


Then I need to weigh it. 
can find the similarity word for word, would give you a normalized score between 0 and 1. Which you could use to weigh the rest.

Compare flag noun vector with the summed up input vector. 




In [3]:
import pickle

def load_obj(name ):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)
    
wav_to_vec_dict=load_obj("wav_to_vec_dict")

# loading the word2vec model


In [4]:

my_w2v=KeyedVectors.load_word2vec_format(os.path.join(data_folder,'GoogleNews-vectors-normed2.bin'),binary=True, encoding='utf-8', unicode_errors='ignore')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


# Turning my w2v vectors into a numpy matrix

In [5]:

filename_dict = dict(enumerate(wav_to_vec_dict.keys()))

sent_dict = dict(enumerate(wav_to_vec_dict.values()))

sent_matrix=np.array(list(sent_dict.values()))



# Replace any nan values with the mean in the matrix

In [7]:
def nan_to_mean(sent_matrix):

    col_mean = np.nanmean(sent_matrix, axis=0)
    inds = np.where(np.isnan(sent_matrix))
    sent_matrix[inds] = np.take(col_mean, inds[1])
    
    return sent_matrix

sent_matrix=nan_to_mean(sent_matrix)


In [8]:

def vecSim(vec1,vec2):
    answer=sklearn.metrics.pairwise.cosine_similarity(vec1.reshape(1,-1),vec2.reshape(1,-1))
    
    return answer[0][0]


def wordlist2vec(words_list,word_vectors):
    final_results=[]
    for word in words_list:
        try:
            if len(final_results)==0:

                final_results=word_vectors[word]
            else:

                final_results=final_results+word_vectors[word]
        except:
            print("word not found: "+word)
            pass
        
    return final_results


tokenized_resulting_string=tokenize_process(input_text)


words_in_voc=[]
#CHECKING FOR WHAT'S IN THE DICTIONARY
for each_word in tokenized_resulting_string:

    if each_word in my_w2v.vocab:

        words_in_voc.append(each_word)

if len(words_in_voc)!=0:

    vectorized_input_text=wordlist2vec(words_in_voc,my_w2v)
    


In [9]:
sent_similarities=np.apply_along_axis(vecSim, 1, sent_matrix,vec2=vectorized_input_text)


In [10]:
adj_sent_similarities=sent_similarities


In [11]:
# keep the top 30 for now
# in the future I might need some kind of minimum similarity 

#either just keep a certain number
best_match_sort=np.argsort(adj_sent_similarities)[::-1][:30]
adj_sent_similarities[best_match_sort]


array([0.653071  , 0.6517482 , 0.6375183 , 0.6016448 , 0.5872398 ,
       0.57675105, 0.56047845, 0.5595844 , 0.545179  , 0.53961396,
       0.52695936, 0.5266143 , 0.5234463 , 0.5234463 , 0.5220766 ,
       0.5147978 , 0.51438665, 0.51360714, 0.5087781 , 0.5073549 ,
       0.50308156, 0.5001439 , 0.49987566, 0.49346298, 0.4928645 ,
       0.49256867, 0.4925185 , 0.4921887 , 0.4921033 , 0.49114478],
      dtype=float32)

In [12]:

#find out how well all sound effect titles compare to all others.
sfx_similarities=sklearn.metrics.pairwise.cosine_similarity(sent_matrix[best_match_sort],sent_matrix[best_match_sort])


# Clustering

The reason for the clustering is because I have a lot of similar sound effects (example: walking.wav and man_walking.wav)
and seeing as I don't want to just retrieve two similar sound effects. By clustering them, there's a better chance that sound effects from different clusters are actually different. 


In [14]:
X = sfx_similarities
clustering = AgglomerativeClustering().fit_predict(X)

unique_clusters=np.unique(clustering)

#how many sfx I want in my video
nber_sfx=2

indexes=[]
for each_cluster in unique_clusters:
    
    indexes.append(np.where(clustering==each_cluster)[0][0])
    
#retrieving indexes for files
sfx_index=best_match_sort[indexes]
#retrieving sfx file names
sfx_files_to_play=[filename_dict[x] for x in sfx_index]

sfx_files_to_play

['House_home\\25_Match_Struck_spark_start_fire.wav',
 'America\\sports_crowd_excited_fans_soccer_football_outside_score_happy.wav']