# Extracción de características
## Adrián Arnaiz Rodríguez

**En este notebook vamos a realizar la extracción de los diferentes tipos de características de los diferentes tipos de audio** 

Tiempo de ejecución aproximado: 40 minutos

<a id="index"></a>
## Índice de contenido

0. [Introducción](#intro)
1. [Extraer medidas de fonación de voiced frames](#fona)
    >- [Extraer para Frase](#fonfra): Guardado en variable numpy `fon_rt_ccas`
    >- [Extraer para Palabras](#fonpal): Guardado en variable numpy `fon_w_palabra_ccas`
    >- [Extraer para Vocales](#fonvoc): Guardado en variable numpy `fon_v_vocal_ccas`
    
    
2. [Extraer medidas de articulación de transiciones](#arti)
    >- [Extraer para Frase](#artifra): Guardado en variable numpy `art_rt_ccas`
    >- [Extraer para Palabras](#artipal): Guardado en variable numpy `art_w_palabra_ccas`
    
    
    
3. [Extracción medidas prosódicas de audios completos](#proso)
    >- [Extraer para Frase](#prosofra): Guardado en variable numpy `prs_rt_ccas`
    
4. [Limpieza de datos](#limp)

## Glosario de variables
#### CADA VARIABLE SERÁ GUARDADA EN LA CARPETA CaracteisticasExtraidas EN FORMATO NUMPY.
* **fon_rt_ccas** : matriz numpy de ccas y target para características de fonación sobre la frase.
* **fon_words_ccas** : diccionario que contiene los datos de fonación y dataframe para cada palabra
 >* ***fon_w_atleta_ccas*** : matriz numpy de ccas y target para características de fonación sobre la palabra atleta
 * ***fon_w_campana_ccas*** : matriz numpy de ccas y target para características de fonación sobre la palabra campana
 * ***fon_w_braso_ccas*** : matriz numpy de ccas y target para características de fonación sobre la palabra braso
 * ***fon_w_gato_ccas*** : matriz numpy de ccas y target para características de fonación sobre la palabra gato
 * ***fon_w_petaka_ccas*** : matriz numpy de ccas y target para características de fonación sobre la palabra petaca
* **fon_vowels_ccas** : diccionario que contiene los datos de fonación y dataframe para cada vocal
 >* ***fon_v_A_ccas*** : matriz numpy de ccas y target para características de fonación sobre la vocal A
 * ***fon_v_E_ccas*** : matriz numpy de ccas y target para características de fonación sobre la vocal E
 * ***fon_v_I_ccas*** : matriz numpy de ccas y target para características de fonación sobre la vocal I
 * ***fon_v_O_ccas*** : matriz numpy de ccas y target para características de fonación sobre la vocal O
 * ***fon_v_U_ccas*** : matriz numpy de ccas y target para características de fonación sobre la vocal U
 
 
* **art_rt_ccas** : matriz numpy de ccas y target para características de articulación sobre la frase.
* **art_words_cas** : diccionario que contiene los datos de articulación y dataframe para cada palabra
 >* ***art_w_atleta_ccas*** : matriz numpy de ccas y target para características de articulación sobre la palabra atleta
 * ***art_w_campana_ccas*** : matriz numpy de ccas y target para características de articulación sobre la palabra campana
 * ***art_w_braso_ccas*** : matriz numpy de ccas y target para características de articulación sobre la palabra braso
 * ***art_w_gato_ccas*** : matriz numpy de ccas y target para características de articulación sobre la palabra gato
 * ***art_w_petaka_ccas*** : matriz numpy de ccas y target para características de articulación sobre la palabra petaca
 
 
* **prs_rt_ccas** : matriz numpy de ccas y target para características de fonación sobre la frase.

------------------

# 0. Introducción <a id="intro"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>
* Se extraerán las características explicadas en otros notebooks con la librería de scripts también anteriormente comentada DisVoice.
* Será necesario ver los Notebooks de Instalación de librerías.ipynb y Ejemplo ejecución scripts Disvoice.ipynb para entender este proceso.
* Los ficheros de las características extraídas se guardarán en ficheros de texto en el directorio *CaracteristicasExtradidas* y también en variables de este notebook.
* **Se etiquetarán con 1 las personas con PD y 0 las personas sanas.**

### Estriuctura de carpetas para los audios
**Los audios deberán estar estructurados de esta manera para la correcta extracción de las características.**
```
.
PC-GITA
│   
└───read-text
│   │
│   └───hc
│   │    │   AVPEPUDEAC0001_readtext.wav
│   │    │    ...
│   │   
│   └───pd
│        │   AVPEPUDEA0001_readtext.wav
│        │    ...  
│
└───words
│   │
│   └───atleta
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_atleta.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_atleta.wav
│   │         │   ...    
│   └───braso
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_braso.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_braso.wav
│   │         │   ...  
│   └───campana
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_campana.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_campana.wav
│   │         │   ...  
│   └───gato
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_gato.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_gato.wav
│   │         │   ...  
│   └───petaka
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_petaka.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_petaka.wav
│   │         │   ...
│           
└───vowels
│   │
│   └───a
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_a1.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_a1.wav
│   │         │   ...    
│   └───e
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_e1.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_e1.wav
│   │         │   ...  
│   └───i
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_i1.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_i1.wav
│   │         │   ...  
│   └───o
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_o1.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_o1.wav
│   │         │   ...  
│   └───u
│   │    │   
│   │    └── hc   
│   │    │    │   AVPEPUDEAC0001_u1.wav
│   │    │    │   ...
│   │    │
│   │    └── pd
│   │         │   AVPEPUDEA0001_u1.wav
│   │         │   ... 
```


### Medidas extraídas para cada tipo de audio:
**Se extaerá para cada vocal un conjunto de medidas de fonación.**
> * **5 subsets de características** (1 por vocal).

**Se extaerá para la frase 3 conjuntos de medidas diferentes: fonación, articulación y prosodia.**
> * Frase: *“Ayer fui al medico. Qué le pasa? Me preguntó. Yo le dije: Ay doctor! Donde pongo el dedo me duele. Tiene la una rota? Sí. Pues ya sabemos que es. Deje su cheque a la salida.”*
* **3 subsets de características.**

**Se extaerá para cada palabra 2 conjuntos de medidas diferentes: fonación y articulación.**
> * Se eligen las **N** palabras que mejor resultado dan en los papers relacionados con nuestro dataset
* Palabras: atleta, campana, gato, petaka, braso.
* **2 x 5 = 10 subsets de características.**

--------

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
!mkdir CaracteristicasExtraidas

Ya existe el subdirectorio o el archivo CaracteristicasExtraidas.


### Creamos una función para añadir el target a una matriz de características

In [3]:
def add_target(ccas, parkinson):
    return np.hstack((ccas,np.ones((ccas.shape[0],1)))) if parkinson else np.hstack((ccas,np.zeros((ccas.shape[0],1))))

### Creamos una función general para extraer las ccas de un tipo de audio concreto
Sacamos las ccas de los sanos y las etiquetamos. Lo mismo para PD y finalmente concatenamos ambas. Los tipos de audio son como ya sabemos: read_text, cada una de las vocales por separado y cada una de las palabras por separado.

Será implementada de la manera siguiente:
* Se pasará como ruta la carpeta del tipo de audio que contiene tanto a los pacientes de Parkinson como a los sanos -> carpetas *PC-GITA/read_text, PC-GITA/words/atleta, PC-GITA/words/braso, ..., PC-GITA/vowels/a, ..., PC-GITA/vowels/u.*
* Se extraerán las ccas para el directorio de los sanos (hc) y son etiquetadas.
* Se extraerán ccas para el directorio de PD y son etiquetadas.
* Se concatenan ambas ya que pertenecen al mismo tipo de audio y se devuelve en formato numpy.

In [4]:
def extraccion_ccas_directorio(script, audios, ccashc, ccaspd):
    '''
    Devuelve en numpy las medidas que saca script de los audios en audios que se guardan en el fichero ccas.
    Recorre para un tipo de audio primero los sanos y los etiqueta y posteriormente hace lo mismo con los PD.
    Finalmente los concatena.
    
    script: Nombre del script a ejecutar, solo nombre.
    audios: ruta del directorio de audios a analizar respecto a src/.
    ccashc: nombre del fichero que se guardará en el directorio CaracteristicasExtradidas para hc.
    ccaspd: nombre del fichero que se guardará en el directorio CaracteristicasExtradidas para pd.
    '''
    #Extraemos las características para las personas sanas
    comando = 'cd Disvoice\\'+script +' & python '+script+'.py '
    comando+= '"../../'+audios+'hc/" "../../CaracteristicasExtraidas/'+ccashc+'" "static" "false"'
    os.system(comando)
    hc = np.loadtxt("CaracteristicasExtraidas/"+ccashc)
    hc = add_target(hc, False)
    assert hc[:,hc.shape[1]-1].all()==0 #aseguramos que etiquetamos de manera correcta
    
    #Extraemos las características para personas con PD
    comando = 'cd Disvoice\\'+script +' & python '+script+'.py '
    comando+= '"../../'+audios+'pd/" "../../CaracteristicasExtraidas/'+ccaspd+'" "static" "false"'
    os.system(comando)
    pd = np.loadtxt("CaracteristicasExtraidas/"+ccaspd)
    pd = add_target(pd, True)
    assert pd[:,pd.shape[1]-1].all()==1
    
    #Devolvemos todo el conjunto entero junto
    return np.concatenate((hc, pd))

----------------

# 1. Extracción medidas de fonación de voiced frames <a id="fona"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

***Las características extraídas son:***
>1. Primera derivada de la Frecuencia Fundamental.
>2. Segunda derivada de la Frecuencia Fundamental.
>3. Jitter.
>4. Shimmer.
>5. APQ (Amplitude perturbation quotient). 
>6. PPQ (Pitch perturbation quotient).
>7. Logaritmic Energy.


**Devolveremos un vector de 29 características ( 7 ccas x 4 [media, std, curtosis y oblicuidad] + grado de unvoiced) para cada audio mas la clase etiquetada**

### 1.1 Extracción de medidas de fonación para la frase <a id="fonfra"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>
Hay una única frase de la que sacar las ccas. Extraeremos por separado las ccas para pacientes con PD y sanos, las etiquetaremos y concatenaremos en el mismo array numpy.

In [5]:
def phonation_readtext_extraction():
    '''
    Llamamos a la función de extracción de características con las rutas necesarias
    '''
    return extraccion_ccas_directorio('phonation', 'PC-GITA/read-text/', 'fon_rt_hc.txt' , 'fon_rt_pd.txt' )

In [6]:
#Extraemos las características
fon_rt_ccas = phonation_readtext_extraction()

In [7]:
#Comprobamos que ha sacado correctamente las 30 ccas de los 100 audios
assert fon_rt_ccas.shape == (100,30)
fon_rt_ccas.shape

(100, 30)

In [8]:
#creamos dataframe para su visualización
name_rt = os.listdir('PC-GITA/read-text/hc')+os.listdir('PC-GITA/read-text/pd')
df_fon_rt = pd.DataFrame(fon_rt_ccas,index=name_rt)
df_fon_rt.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001_readtext.wav,45.58992,-0.180555,-0.007559,3.382826,7.064785,31.994413,5.162992,-19.549818,27.064541,38.878663,...,3.58615,-0.654771,20.360747,13.429554,21.497042,15.870199,2.270639,20.074127,2.588947,0.0
AVPEPUDEAC0003_readtext.wav,55.830389,-0.032528,-0.002346,2.909123,5.788144,28.884041,3.382407,-21.318398,9.640714,14.227814,...,5.120765,-0.489256,25.950614,25.277762,31.146949,21.858915,2.987013,39.323302,3.130708,0.0
AVPEPUDEAC0004_readtext.wav,47.713951,0.352271,0.029832,2.280781,7.223241,30.927178,3.389033,-19.264654,18.425076,25.224171,...,4.595876,-1.055237,38.567563,21.366797,45.712238,12.475809,2.599666,28.215186,3.625223,0.0
AVPEPUDEAC0005_readtext.wav,39.330544,-0.086,-0.002401,2.625822,9.195097,31.287548,2.736765,-16.843645,11.12984,15.518005,...,3.218766,-0.810821,24.18209,36.701796,33.630067,10.816464,2.69971,17.193896,3.238379,0.0
AVPEPUDEAC0006_readtext.wav,46.006749,-0.082522,0.001371,2.266958,7.164403,36.94138,2.644868,-20.89818,15.581972,20.953881,...,4.613178,-0.429564,18.574532,12.463404,22.025656,20.825562,3.946787,33.697596,2.482541,0.0


In [9]:
#Guardamos las ccas en numpy
np.save('CaracteristicasExtraidas/fon_rt_ccas',fon_rt_ccas)

--------

### 1.2 Extracción de medidas de fonación para las palabras <a id="fonpal"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>
Se extraerá un subset para cada una de las palabras que mejor funcionan.

**En la función recorremos las palabras de las que queremos extraer las características. Para cada una extraemos su matriz de ccas en numpy y en DataFrame de panda. Devuelve un diccionario con cada palabra como clave. Ese diccionario contiene otro diccionario cuya clave es numpy o panda dependiendo del formato que queramos los datos**.

Para ello utilizamos la función anteriormente definida que extraía características de un tipo de audio en concreto. En esta función recorremos words y pasamos el directorio de cada palabra a esa función. Esta función recibe únicamente las palabras y se encarga de llamar a la función anterior con la ruta completa para esa palabra.

`fon_words_ccas = { 'atleta': {'numpy': [[1,2..],[3,2..]], 'dataframe': pd.DF },
                    'petaka': {'numpy': [[7,5..],[4,9..]], 'dataframe': pd.DF }                                                              }`

In [10]:
def phonation_word_extraction(palabras):
    '''
    Llamamos a la función de extracción de características con las rutas necesarias
    '''
    ccas_palabras = dict()
    for p in palabras:
        ccas_palabras[p] = dict()
        ccas_palabras[p]['numpy'] = extraccion_ccas_directorio('phonation', 'PC-GITA/words/'+p+'/', 'fon_w_'+p+'_hc.txt' , 'fon_w_'+p+'_pd.txt' )
        names= os.listdir('PC-GITA/words/'+p+'/hc')+os.listdir('PC-GITA/words/'+p+'/pd')
        ccas_palabras[p]['dataframe'] = pd.DataFrame(ccas_palabras[p]['numpy'],index=names)
        print('Palabras analizadas: ',ccas_palabras.keys())
    return ccas_palabras

In [11]:
words=['atleta','campana','gato','petaka','braso']
fon_words_ccas = phonation_word_extraction(words)

Palabras analizadas:  dict_keys(['atleta'])
Palabras analizadas:  dict_keys(['atleta', 'campana'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato', 'petaka'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato', 'petaka', 'braso'])


In [12]:
fon_words_ccas.keys()

dict_keys(['atleta', 'campana', 'gato', 'petaka', 'braso'])

##### 1.2.1 Atleta

In [13]:
fon_w_atleta_ccas = fon_words_ccas['atleta']['numpy']

In [14]:
assert fon_w_atleta_ccas.shape == (100,30)
fon_w_atleta_ccas.shape

(100, 30)

In [15]:
df_fon_w_atleta  = fon_words_ccas['atleta']['dataframe']
df_fon_w_atleta.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001atleta.wav,37.209302,0.263179,0.757631,6.241905,11.653086,49.109118,4.486473,-20.74435,33.490068,50.780001,...,2.754526,-0.117091,8.050549,11.001079,7.29146,7.099865,2.226755,10.321506,1.790678,0.0
AVPEPUDEAC0003atleta.wav,42.5,-0.43757,-0.523837,3.717318,16.115367,62.466374,4.040991,-20.223288,9.613317,13.195052,...,1.142504,-0.544338,11.784471,6.916095,12.561063,3.302717,1.617704,3.103826,2.389755,0.0
AVPEPUDEAC0004atleta.wav,36.842105,-0.413006,-0.631851,6.100776,15.817663,54.660247,10.105509,-18.240561,35.115596,60.091766,...,2.820359,-0.961777,13.318142,10.213105,13.069356,4.323537,2.724788,10.44166,2.898859,0.0
AVPEPUDEAC0005atleta.wav,41.463415,-0.342675,-0.500265,3.528358,14.981613,42.30549,3.020648,-17.72528,10.897566,13.744397,...,1.838674,-1.127669,8.690101,8.153002,11.326022,4.030117,2.707122,5.277319,3.035835,0.0
AVPEPUDEAC0006atleta.wav,31.25,-5.159983,-0.030316,3.470258,16.730781,54.581587,1.488015,-19.564387,21.547273,31.600701,...,1.041484,-0.36227,15.348422,8.002548,16.74963,3.103318,1.860544,2.788721,1.701754,0.0


In [16]:
np.save('CaracteristicasExtraidas/fon_w_atleta_ccas',fon_w_atleta_ccas)

##### 1.2.2 Campana

In [17]:
fon_w_campana_ccas = fon_words_ccas['campana']['numpy']

In [18]:
assert fon_w_campana_ccas.shape == (100,30)
fon_w_campana_ccas.shape

(100, 30)

In [19]:
df_fon_w_camapana = fon_words_ccas['campana']['dataframe']
df_fon_w_camapana.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001campana.wav,14.634146,1.557845,1.450032,2.699567,8.449815,42.041152,2.082653,-20.56257,13.735793,17.297768,...,2.7533,-0.010711,10.876199,9.80164,13.880779,11.37704,3.478157,9.596572,2.058204,0.0
AVPEPUDEAC0003campana.wav,27.272727,-0.565581,0.235706,3.856959,8.403155,36.969943,3.606387,-18.661289,11.953691,19.470171,...,3.146372,-1.133168,17.29313,13.758207,18.285291,19.419755,2.377221,13.414565,3.752878,0.0
AVPEPUDEAC0004campana.wav,26.829268,-1.058662,-0.201209,3.457628,9.000631,34.657313,3.435484,-16.963725,28.948691,49.347424,...,3.177224,-0.815346,15.596879,14.087266,14.702856,13.778097,2.774502,13.064172,2.574991,0.0
AVPEPUDEAC0005campana.wav,31.818182,-1.095331,0.060272,3.314671,9.042618,43.707104,3.460733,-17.544373,14.001238,19.423711,...,2.055786,-0.624777,18.392428,11.54375,19.148304,12.520071,4.250439,6.122724,3.094152,0.0
AVPEPUDEAC0006campana.wav,21.875,4.820747,0.018058,3.579771,10.745706,43.734666,4.389756,-19.32032,22.330106,33.771086,...,2.568851,-0.426564,16.734657,9.485948,18.635765,7.844283,1.923068,8.832825,2.022774,0.0


In [20]:
np.save('CaracteristicasExtraidas/fon_w_campana_ccas',fon_w_campana_ccas)

##### 1.2.3 Gato

In [21]:
fon_w_gato_ccas = fon_words_ccas['gato']['numpy']

In [22]:
df_fon_w_gato = fon_words_ccas['gato']['dataframe'].head()
df_fon_w_gato.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001gato.wav,34.375,-0.912906,0.177372,6.708612,9.85318,71.0183,12.226551,-23.000618,35.162057,47.177981,...,0.647977,0.467576,7.529346,3.948183,6.715125,5.835488,1.911439,1.794171,2.267345,0.0
AVPEPUDEAC0003gato.wav,55.172414,-17.321974,-2.616479,7.139592,14.63191,49.813333,10.566397,-21.059729,59.001519,96.600622,...,1.236424,-0.269174,9.32955,5.392853,10.441583,4.065314,0.0,3.435396,1.441643,0.0
AVPEPUDEAC0004gato.wav,48.275862,-3.700018,-0.800627,3.59562,11.502367,38.244115,4.346517,-19.288531,16.114886,24.870995,...,1.09952,-0.222756,9.025506,5.984549,11.363206,2.454031,1.5,2.547124,1.861459,0.0
AVPEPUDEAC0005gato.wav,45.16129,-2.756495,-1.500746,4.87399,11.1922,54.9324,3.819957,-18.262836,16.090507,18.518783,...,1.707542,-0.037807,8.064149,7.254751,11.1738,9.407455,2.04662,5.25012,1.593889,0.0
AVPEPUDEAC0006gato.wav,33.333333,-2.496852,-0.530901,6.284578,13.271619,55.611344,11.945208,-21.134894,33.261049,45.436813,...,0.653151,-0.355216,8.473651,5.116294,8.389119,2.306112,1.543276,1.593581,2.162482,0.0


In [23]:
np.save('CaracteristicasExtraidas/fon_w_gato_ccas',fon_w_gato_ccas)

##### 1.2.4 Petaka

In [24]:
fon_w_petaka_ccas = fon_words_ccas['petaka']['numpy']

In [25]:
df_fon_w_petaka = fon_words_ccas['petaka']['dataframe'].head()
df_fon_w_petaka.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001petaka.wav,39.534884,-5.436706,0.087709,4.611277,10.830924,66.71046,8.104098,-22.70104,30.59932,43.259772,...,1.098111,0.050257,9.451944,5.124114,7.841727,8.590627,2.595252,3.138978,1.519287,0.0
AVPEPUDEAC0003petaka.wav,58.333333,-0.9505,-0.382478,2.905243,21.38449,77.313733,0.819345,-18.621882,6.317318,7.721874,...,1.234768,-1.049734,8.739768,4.671769,10.408987,4.5804,1.5,3.596851,2.91658,0.0
AVPEPUDEAC0004petaka.wav,48.780488,-1.665391,-0.467323,2.921456,14.010929,55.168863,3.289853,-19.083201,12.759047,16.962451,...,0.375414,-0.303916,7.571001,3.828192,7.327211,12.673284,1.826432,1.770381,1.982356,0.0
AVPEPUDEAC0005petaka.wav,57.894737,-2.621687,-0.381915,6.336473,16.109262,32.277524,8.000591,-14.345954,25.072708,37.784108,...,0.434209,-0.885221,6.977081,3.403636,6.060849,4.667633,1.681504,1.729807,3.209285,0.0
AVPEPUDEAC0006petaka.wav,28.947368,-1.679855,-0.025326,1.539837,11.334187,62.267051,1.850957,-21.158786,7.811683,11.536216,...,0.981785,0.085687,10.655733,5.388514,10.535,14.540899,2.174295,2.766974,1.629073,0.0


In [26]:
np.save('CaracteristicasExtraidas/fon_w_petaka_ccas',fon_w_petaka_ccas)

##### 1.2.5 Braso

In [27]:
fon_w_braso_ccas = fon_words_ccas['braso']['numpy']

In [28]:
df_fon_w_braso = fon_words_ccas['braso']['dataframe'].head()
df_fon_w_braso.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001braso.wav,25.714286,3.343057,0.487152,4.424879,8.117806,33.274852,4.17534,-22.869925,23.841214,34.009409,...,2.999256,0.084013,14.808166,10.522175,18.448042,3.375617,2.275681,11.254537,1.776811,0.0
AVPEPUDEAC0003braso.wav,40.625,-1.085594,1.253372,7.2005,11.406784,34.275261,7.087388,-19.397205,18.493539,26.530073,...,1.074228,-0.362152,5.081131,3.363732,4.673962,6.759178,2.186137,3.206795,1.412197,0.0
AVPEPUDEAC0004braso.wav,29.032258,-2.093897,-0.181963,2.264396,10.287819,50.278285,1.812007,-17.354383,8.693359,9.786969,...,1.066565,-0.850381,5.241551,3.89479,6.006356,4.769541,2.079927,2.895359,2.364365,0.0
AVPEPUDEAC0005braso.wav,36.111111,-0.606649,-0.22574,5.10612,8.745526,18.54958,5.331556,-16.646135,16.371756,26.237283,...,2.284294,-0.950251,7.691936,8.306096,8.694405,6.356049,2.249476,8.299414,2.816978,0.0
AVPEPUDEAC0006braso.wav,17.647059,-1.27952,0.073474,2.004974,7.682169,31.918942,1.944384,-18.957871,9.725343,14.263869,...,1.861142,-0.101397,13.207212,8.499125,15.751931,6.338592,1.922388,5.322566,2.215963,0.0


In [29]:
np.save('CaracteristicasExtraidas/fon_w_braso_ccas',fon_w_braso_ccas)

----------

### 1.3 Extracción de medidas de fonación para las vocales <a id="fonvoc"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>
Se extraerá un subset para cada una de las vocales.

**En la función recorremos las vocales de las que queremos extraer las características. Para cada una extraemos su matriz de ccas en numpy y en DataFrame de panda. Devuelve un diccionario con cada vocal como clave. Ese diccionario contiene otro diccionario cuya clave es numpy o panda dependiendo del formato que queramos los datos**.

Para ello utilizamos la función anteriormente definida que extraía características de un tipo de audio en concreto. En esta función recorremos las vocales y pasamos el directorio de cada vocal a esa función. Esta función recibe únicamente las vocales y se encarga de llamar a la función anterior con la ruta completa para esa vocal.

`fon_vowels_ccas = { 'a': {'numpy': [[1,2..],[3,2..]], 'dataframe': pd.DF },
                    'e': {'numpy': [[7,5..],[4,9..]], 'dataframe': pd.DF }                                                              }`

In [30]:
def phonation_vowel_extraction(vowels):
    '''
    Llamamos a la función de extracción de características con las rutas necesarias
    '''
    ccas_vocales = dict()
    for v in vowels:
        ccas_vocales[v] = dict()
        ccas_vocales[v]['numpy'] = extraccion_ccas_directorio('phonation', 'PC-GITA/vowels/'+v+'/', 'fon_v_'+v+'_hc.txt' , 'fon_v_'+v+'_pd.txt' )
        names= os.listdir('PC-GITA/vowels/'+v+'/hc')+os.listdir('PC-GITA/vowels/'+v+'/pd')
        ccas_vocales[v]['dataframe'] = pd.DataFrame(ccas_vocales[v]['numpy'],index=names)
        print('Vocales analizadas: ',ccas_vocales.keys())
    return ccas_vocales

In [31]:
fon_vowels_ccas = phonation_vowel_extraction(['A','E','I','O','U'])

Vocales analizadas:  dict_keys(['A'])
Vocales analizadas:  dict_keys(['A', 'E'])
Vocales analizadas:  dict_keys(['A', 'E', 'I'])
Vocales analizadas:  dict_keys(['A', 'E', 'I', 'O'])
Vocales analizadas:  dict_keys(['A', 'E', 'I', 'O', 'U'])


##### 1.3.1 A

In [32]:
fon_v_A_ccas = fon_vowels_ccas['A']['numpy']

In [33]:
fon_v_A_ccas.shape #300,30

(300, 30)

In [34]:
df_fon_v_A = fon_vowels_ccas['A']['dataframe']
df_fon_v_A.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001a1.wav,2.985075,0.387348,-0.070557,0.381387,2.114075,3.179037,0.193298,-11.663632,1.267161,0.824763,...,1.659987,-1.77314,5.982853,2.585707,8.613248,4.095289,2.563482,7.113523,6.514427,0.0
AVPEPUDEAC0001a2.wav,8.0,-1.593897,-0.093742,1.716067,3.09167,5.563952,2.188715,-13.007131,17.176457,24.598319,...,3.222304,-1.597765,41.690144,21.607404,42.964778,5.903705,2.079427,12.163514,5.065292,0.0
AVPEPUDEAC0001a3.wav,3.846154,1.270619,0.045786,0.626601,4.373655,6.601985,0.165308,-12.206003,1.720442,1.042476,...,2.471433,-1.358714,3.579401,7.332171,4.607747,4.453881,1.724391,11.161962,3.449415,0.0
AVPEPUDEAC0003a1.wav,1.265823,0.00019,0.006351,0.954605,1.893255,5.48277,0.717664,-17.885359,1.246632,1.38376,...,1.765245,0.2953,3.430727,3.805707,4.537526,6.289391,4.409896,7.911378,2.014197,0.0
AVPEPUDEAC0003a2.wav,4.0,0.387974,-0.082857,0.701856,3.415714,8.264762,0.373591,-13.832545,1.195379,0.764615,...,1.412863,0.649826,5.769379,4.097011,8.617175,4.257346,2.299671,4.67557,1.836906,0.0


In [35]:
np.save('CaracteristicasExtraidas/fon_v_A_ccas',fon_v_A_ccas)

##### 1.3.2 E

In [36]:
fon_v_E_ccas = fon_vowels_ccas['E']['numpy']

In [37]:
fon_v_E_ccas.shape

(300, 30)

In [38]:
df_fon_v_E = fon_vowels_ccas['E']['dataframe']
df_fon_v_E.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001e1.wav,3.448276,0.488747,0.054217,0.351233,3.694646,4.704833,0.156898,-7.869,1.04861,0.969487,...,1.271771,-2.024045,4.267277,5.124661,6.324998,9.228551,5.676948,3.670062,8.158257,0.0
AVPEPUDEAC0001e2.wav,7.272727,0.594295,0.125128,0.446609,2.899152,5.263382,0.191244,-10.759785,1.413129,1.301634,...,0.632146,0.036877,11.999319,15.522694,23.077896,3.699544,2.556864,2.280667,1.863171,0.0
AVPEPUDEAC0001e3.wav,7.843137,0.404945,-0.001345,0.372319,3.700708,4.75749,0.174784,-7.977883,1.068285,1.029418,...,1.516486,-0.841117,3.984804,3.720448,3.058141,4.178861,3.34722,4.907642,3.873718,0.0
AVPEPUDEAC0003e1.wav,25.324675,0.484714,-0.04333,0.670662,1.937259,5.633865,0.387582,-16.719993,3.870494,5.407784,...,7.539372,0.957572,96.945621,54.131769,103.110404,14.617839,7.88897,70.005889,4.176245,0.0
AVPEPUDEAC0003e2.wav,49.673203,0.034788,0.002764,1.656038,3.397673,9.569013,2.190465,-14.970104,7.844919,10.975147,...,2.372477,0.851977,30.656743,15.53387,30.697626,3.871824,2.371621,7.974402,2.762013,0.0


In [39]:
np.save('CaracteristicasExtraidas/fon_v_E_ccas',fon_v_E_ccas)

##### 1.3.3 I

In [40]:
fon_v_I_ccas = fon_vowels_ccas['I']['numpy']

In [41]:
fon_v_I_ccas.shape

(300, 30)

In [42]:
df_fon_v_I = fon_vowels_ccas['I']['dataframe']
df_fon_v_I.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001i1.wav,7.54717,2.998905,0.00683,1.297058,3.571238,6.702213,1.891581,-7.828213,15.625998,22.51093,...,3.642067,-1.075105,45.478971,23.394703,46.656886,6.765462,3.897816,14.80239,3.855432,0.0
AVPEPUDEAC0001i2.wav,4.545455,0.715651,0.132883,0.435609,3.077857,5.390932,0.21841,-7.34176,1.379148,1.300752,...,3.415251,-0.939878,7.450114,7.785121,9.287578,2.782504,1.910833,17.327323,3.718781,0.0
AVPEPUDEAC0001i3.wav,9.302326,0.776486,-0.018152,0.442105,3.707595,6.325244,0.192885,-7.966394,0.947265,1.007533,...,1.27878,-1.501569,2.581623,3.371044,2.00131,4.919458,4.960079,4.925254,6.073972,0.0
AVPEPUDEAC0003i1.wav,31.288344,0.152931,-0.049941,1.707685,2.35837,5.662404,2.64834,-13.432639,23.851342,34.361284,...,3.816159,-0.153737,53.122006,25.88967,52.497495,17.960409,3.6428,16.881619,1.570082,0.0
AVPEPUDEAC0003i2.wav,64.321608,0.850295,-0.061087,1.899609,3.102822,5.556011,2.094865,-12.117389,7.24611,10.953665,...,2.700817,-0.124865,21.586647,11.121414,24.221042,5.826794,5.225726,10.190176,2.095695,0.0


In [43]:
np.save('CaracteristicasExtraidas/fon_v_I_ccas',fon_v_I_ccas)

##### 1.3.4 O

In [44]:
fon_v_O_ccas = fon_vowels_ccas['O']['numpy']

In [45]:
fon_v_O_ccas.shape

(300, 30)

In [46]:
df_fon_v_O = fon_vowels_ccas['O']['dataframe']
df_fon_v_O.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001o1.wav,8.163265,0.441181,-0.24643,0.39906,3.201527,3.656395,0.142608,-7.222527,1.475071,0.985585,...,0.199136,-1.646611,10.973119,6.048119,9.445613,8.928874,7.27788,2.218545,6.859044,0.0
AVPEPUDEAC0001o2.wav,5.405405,0.837917,0.027773,0.538214,4.627783,7.570859,0.244312,-12.580003,1.54514,1.255048,...,0.745277,-0.099574,4.435587,3.791638,4.715407,4.406013,2.678295,2.680962,1.872471,0.0
AVPEPUDEAC0001o3.wav,4.761905,1.292839,0.433829,0.682458,4.762853,8.742055,0.270752,-10.766899,2.737106,2.421846,...,2.901494,-1.149622,8.782392,12.554612,9.368785,3.445836,4.335708,13.2864,3.117525,0.0
AVPEPUDEAC0003o1.wav,18.719212,0.111727,-0.01496,0.877343,1.647125,5.370207,0.915111,-20.398424,4.106959,5.290709,...,4.633961,1.329516,70.609484,39.918207,76.070018,18.136437,5.380536,25.560423,3.611659,0.0
AVPEPUDEAC0003o2.wav,62.032086,0.523141,-0.046233,1.964475,2.453327,6.201136,1.864457,-16.394372,9.382762,15.46407,...,3.886752,0.575811,26.682165,24.37625,25.488767,5.069681,7.966983,19.569812,1.770345,0.0


In [47]:
np.save('CaracteristicasExtraidas/fon_v_O_ccas',fon_v_O_ccas)

##### 1.3.5 U

In [48]:
fon_v_U_ccas = fon_vowels_ccas['U']['numpy']

In [49]:
fon_v_U_ccas.shape

(300, 30)

In [50]:
df_fon_v_U = fon_vowels_ccas['U']['dataframe']
df_fon_v_U.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
AVPEPUDEAC0001u1.wav,8.510638,-0.000199,-0.090555,0.58219,3.986624,7.407212,0.23554,-9.22909,2.074067,1.441283,...,2.853415,0.299099,6.263822,6.33889,8.109899,11.591994,2.31811,11.724652,2.319731,0.0
AVPEPUDEAC0001u2.wav,5.714286,0.037247,-0.377785,0.723754,5.426768,10.090378,0.265044,-7.618944,2.750681,1.644609,...,1.17751,0.321345,8.748865,3.031298,11.468794,3.329694,2.838693,4.208642,2.276656,0.0
AVPEPUDEAC0001u3.wav,2.5,0.297518,-0.125338,0.367888,2.842377,5.363515,0.147112,-9.798947,1.652143,0.78577,...,2.671553,-0.838342,9.933459,6.514393,12.896475,7.542959,5.661478,12.560645,4.902115,0.0
AVPEPUDEAC0003u1.wav,43.147208,0.107614,-0.039153,1.103996,1.592935,4.145098,0.936207,-14.075732,4.411092,6.200884,...,3.463063,0.45024,32.878962,20.891695,34.130603,13.237685,5.656461,15.21275,2.463326,0.0
AVPEPUDEAC0003u2.wav,38.576779,-0.019657,0.139525,1.790683,1.70164,3.26029,3.302586,-12.082711,21.820362,31.293258,...,3.930689,-0.491652,63.818441,31.53802,63.92691,12.915591,12.398458,20.164465,2.327022,0.0


In [51]:
np.save('CaracteristicasExtraidas/fon_v_U_ccas',fon_v_U_ccas)

-------------------
# 2. Extracción medidas de articulación de transiciones <a id="arti"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

***Las características extraídas son 122 en total, en resumen son:***
>* 1-22  : Las 22 **BBE** en onset (Bark band energies).
>* 23-58  : Las 12 **MFCC** en onset (normal, primera y segunda derivada).
>* 59-80  : Las 22 **BBE** en en offset transitions.
>* 81-116 : Las 12 **MFCC** en offset (normal, primera y segunda derivada).
>* 117-122: **Primera y segunda formante de frecuencia** (normal, primera y segunda derivada).

**Devolveremos vector de 488 características (122 ccas x 4 [media, std, curtosis y oblicuidad]) para cada audio mas la clase etiquetada**

### 2.1 Extracción medidas de articulación para la frase <a id="artifra"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

In [52]:
def articulation_readtext_extraction():
    return extraccion_ccas_directorio('articulation', 'PC-GITA/read-text/', 'art_rt_hc.txt' , 'art_rt_pd.txt' )

In [53]:
art_rt_ccas = articulation_readtext_extraction()

In [54]:
assert art_rt_ccas.shape == (100,489)
art_rt_ccas.shape

(100, 489)

In [55]:
#creamos dataframe para su visualización
df_art_rt = pd.DataFrame(art_rt_ccas,index=name_rt)
df_art_rt.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001_readtext.wav,-2.127603,-1.400161,-1.105309,-1.429709,-2.052056,-2.645392,-3.093076,-3.246626,-3.623741,-3.834461,...,5.040503,6.470057,4.471047,2.98543,8.375337,7.774559,3.042709,7.967512,6.900078,0.0
AVPEPUDEAC0003_readtext.wav,-1.807968,-1.044077,-0.549363,-0.684349,-1.150073,-1.823092,-2.394385,-2.590278,-3.027783,-3.066149,...,3.515764,2.831697,5.964165,5.98642,8.84718,8.370068,3.936491,7.596534,6.741154,0.0
AVPEPUDEAC0004_readtext.wav,-2.143417,-1.403522,-1.339602,-1.365834,-1.463873,-2.005851,-2.866167,-3.24012,-3.465573,-3.803078,...,10.526255,6.458614,4.454854,8.038107,13.216971,12.58005,3.042227,9.085446,7.721547,0.0
AVPEPUDEAC0005_readtext.wav,-1.762943,-1.159295,-1.302682,-1.225428,-1.638826,-2.470203,-2.923691,-3.395639,-3.670058,-3.885788,...,3.395496,5.939081,7.928204,17.620274,25.213209,24.409557,2.830009,9.303097,9.797578,0.0
AVPEPUDEAC0006_readtext.wav,-2.649578,-2.028896,-1.829117,-2.295398,-2.777043,-3.14319,-3.58947,-3.842716,-4.018132,-4.210443,...,4.449834,6.189741,5.300837,4.023716,9.582126,7.267149,3.424262,6.79744,6.103247,0.0


In [56]:
np.save('CaracteristicasExtraidas/art_rt_ccas',art_rt_ccas)

### 2.1 Extracción medidas de articulación para las palabras <a id="artipal"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

Se extraerá un subset para cada una de las palabras que mejor funcionan.

**En la función recorremos las palabras de las que queremos extraer las características. Para cada una extraemos su matriz de ccas en numpy y en DataFrame de panda. Devuelve un diccionario con cada palabra como clave. Ese diccionario contiene otro diccionario cuya clave es numpy o panda dependiendo del formato que queramos los datos** 

`art_words_ccas = { 'atleta': {'numpy': [[1,2..],[3,2..]], 'dataframe': pd.DF }
                    'petaka': {'numpy': [[7,5..],[4,9..]], 'dataframe': pd.DF }                                                              }`

In [57]:
def articulation_word_extraction(palabras):
    '''
    Llamamos a la función de extracción de características con las rutas necesarias
    '''
    ccas_palabras = dict()
    for p in palabras:
        ccas_palabras[p] = dict()
        ccas_palabras[p]['numpy'] = extraccion_ccas_directorio('articulation', 'PC-GITA/words/'+p+'/', 'art_w_'+p+'_hc.txt' , 'art_w_'+p+'_pd.txt' )
        names= os.listdir('PC-GITA/words/'+p+'/hc')+os.listdir('PC-GITA/words/'+p+'/pd')
        ccas_palabras[p]['dataframe'] = pd.DataFrame(ccas_palabras[p]['numpy'],index=names)
        print('Palabras analizadas: ',ccas_palabras.keys())
    return ccas_palabras

In [58]:
art_words_ccas = articulation_word_extraction(words)

Palabras analizadas:  dict_keys(['atleta'])
Palabras analizadas:  dict_keys(['atleta', 'campana'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato', 'petaka'])
Palabras analizadas:  dict_keys(['atleta', 'campana', 'gato', 'petaka', 'braso'])


In [59]:
art_words_ccas.keys()

dict_keys(['atleta', 'campana', 'gato', 'petaka', 'braso'])

#### 2.2.1 Atleta

In [60]:
art_w_atleta_ccas = art_words_ccas['atleta']['numpy']

In [61]:
art_w_atleta_ccas.shape #100,489

(100, 489)

In [62]:
df_art_w_atleta = art_words_ccas['atleta']['dataframe']
df_art_w_atleta.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001atleta.wav,-1.758446,-0.836345,-0.818217,-1.038274,-1.581532,-1.959414,-2.097798,-2.140041,-2.557434,-2.867824,...,5.278505,2.32172,2.813723,14.818117,13.519394,12.903929,3.230332,13.438401,7.672462,0.0
AVPEPUDEAC0003atleta.wav,-0.856915,-0.150238,-0.254811,-0.204328,-0.653231,-0.859401,-1.086747,-1.159226,-1.895511,-2.104082,...,3.12672,2.706656,4.258966,3.997719,6.100805,5.167973,2.252313,8.321375,4.29206,0.0
AVPEPUDEAC0004atleta.wav,-1.796297,-0.847688,-0.593225,-0.797304,-0.821726,-1.238053,-2.214205,-2.269652,-2.697795,-2.962419,...,1.77131,1.963977,1.924449,7.001996,13.814677,10.108188,2.184575,17.059693,9.134573,0.0
AVPEPUDEAC0005atleta.wav,-2.487941,-1.88414,-2.248144,-1.712956,-2.00203,-3.321973,-4.038867,-4.265101,-4.771604,-5.105755,...,3.279918,2.511234,3.336133,1.964502,11.397696,4.999936,3.254491,9.824213,7.839371,0.0
AVPEPUDEAC0006atleta.wav,-1.101835,0.026807,0.04626,-0.041983,-0.076307,-0.405087,-0.539632,-1.127519,-1.307959,-1.061155,...,1.369858,2.972917,2.127315,2.76799,15.222408,9.57553,2.172555,12.783057,7.499528,0.0


In [63]:
np.save('CaracteristicasExtraidas/art_w_atleta_ccas',art_w_atleta_ccas)

#### 2.2.2 Campana

In [64]:
art_w_campana_ccas = art_words_ccas['campana']['numpy']

In [65]:
art_w_campana_ccas.shape #100,489

(100, 489)

In [66]:
df_art_w_campana = art_words_ccas['campana']['dataframe']
df_art_w_campana.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001campana.wav,-1.567073,-0.681148,-0.717816,-1.578237,-2.826475,-3.557363,-3.628994,-3.464867,-3.688759,-4.225435,...,1.5,1.5,1.5,2.210811,13.205211,9.940126,4.583486,10.488569,6.443442,0.0
AVPEPUDEAC0003campana.wav,-2.139544,-1.889852,-1.372378,-1.794797,-2.411415,-2.774713,-3.16947,-3.135621,-3.726809,-3.760447,...,1.714063,2.257439,1.79526,3.458101,4.501891,3.777923,1.715225,8.761261,4.402421,0.0
AVPEPUDEAC0004campana.wav,-1.724386,-1.311269,-1.649125,-2.460888,-2.415533,-2.497631,-3.343926,-3.083625,-3.121619,-3.334308,...,2.117398,1.79429,2.146002,6.140402,12.697975,5.70234,4.799685,10.25874,5.111754,0.0
AVPEPUDEAC0005campana.wav,-1.086933,-0.654446,-0.714537,-1.020416,-1.38537,-2.245658,-2.479345,-2.370331,-3.096108,-3.192161,...,2.489801,2.686664,2.819136,1.795356,11.153049,6.725425,3.201236,10.21239,6.650511,0.0
AVPEPUDEAC0006campana.wav,-1.061914,-0.050286,-0.194129,-0.197104,-0.854592,-0.688064,-0.63422,-1.227628,-1.124753,-1.419169,...,2.230007,2.001822,2.84216,10.336635,10.171814,11.222093,2.784166,9.922195,5.694974,0.0


In [67]:
np.save('CaracteristicasExtraidas/art_w_campana_ccas',art_w_campana_ccas)

#### 2.2.3 Gato

In [68]:
art_w_gato_ccas = art_words_ccas['gato']['numpy']

In [69]:
art_w_gato_ccas.shape #100,489

(100, 489)

In [70]:
df_art_w_gato = art_words_ccas['gato']['dataframe']
df_art_w_gato.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001gato.wav,-2.354656,-1.717827,-1.415719,-1.369114,-2.270492,-3.410199,-3.358952,-3.347714,-3.220376,-3.756014,...,2.073631,2.016578,2.23497,4.800755,10.611872,12.740966,3.982177,6.834917,4.964551,0.0
AVPEPUDEAC0003gato.wav,-1.870798,-1.497047,-0.987716,-1.071903,-1.913794,-2.862332,-3.189569,-3.15804,-3.546608,-3.567739,...,3.289726,2.001362,1.551987,2.950282,3.103816,2.910483,2.501024,6.407497,3.572984,0.0
AVPEPUDEAC0004gato.wav,-2.039474,-1.127992,-0.853245,-0.759313,-1.201419,-1.698728,-2.818457,-3.655191,-3.755636,-3.839821,...,1.5,1.5,1.5,2.531261,5.843431,3.329392,2.214906,6.175351,3.39539,0.0
AVPEPUDEAC0005gato.wav,-1.739202,-1.332499,-1.164141,-1.114888,-1.471002,-2.598682,-3.187167,-4.119215,-4.235214,-4.439598,...,1.973804,1.650886,2.067884,16.521055,11.246719,14.149603,2.173454,6.002365,5.465717,0.0
AVPEPUDEAC0006gato.wav,-1.012656,-0.582052,-0.229541,0.058478,-0.185242,-0.60822,-1.474853,-2.241681,-2.243793,-2.504338,...,1.830446,2.174578,1.255449,2.404008,7.566827,4.448724,3.177582,4.851406,4.642232,0.0


In [71]:
np.save('CaracteristicasExtraidas/art_w_gato_ccas',art_w_gato_ccas)

#### 2.2.4 Petaka

In [72]:
art_w_petaka_ccas = art_words_ccas['petaka']['numpy']

In [73]:
art_w_petaka_ccas.shape #100,489

(100, 489)

In [74]:
df_art_w_petaka = art_words_ccas['petaka']['dataframe']
df_art_w_petaka.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001petaka.wav,-1.97336,-1.135221,-1.113846,-1.363613,-1.504309,-2.248044,-2.335793,-2.215129,-2.697518,-3.094406,...,2.948476,3.276315,2.498014,2.183321,5.480411,3.110792,3.235388,20.798008,10.483155,0.0
AVPEPUDEAC0003petaka.wav,-0.959998,-0.254548,0.034062,0.263732,0.429882,0.233652,-0.609069,-0.624297,-1.156101,-1.166783,...,2.617856,2.467946,2.336146,4.121163,17.760581,8.693084,2.200168,16.323693,8.62348,0.0
AVPEPUDEAC0004petaka.wav,-1.645262,-1.305023,-1.152081,-0.76684,-0.908209,-1.061711,-1.796233,-2.607557,-3.397236,-3.332771,...,2.128269,3.070303,2.449342,16.99799,11.181957,11.846039,8.91056,8.975931,9.227989,0.0
AVPEPUDEAC0005petaka.wav,-1.144195,-0.779963,-0.801965,-0.683949,-0.841663,-1.814159,-2.445552,-3.155172,-3.482943,-3.894846,...,1.893158,2.127355,4.012583,1.803343,5.922269,3.490265,2.737102,5.702261,5.955396,0.0
AVPEPUDEAC0006petaka.wav,-1.440775,-0.875594,-1.030039,-1.365907,-1.246613,-1.948272,-2.126775,-2.993997,-3.038408,-3.145563,...,1.909534,3.362664,3.35617,8.125488,12.523997,13.451336,2.099482,11.914898,6.451508,0.0


In [75]:
np.save('CaracteristicasExtraidas/art_w_petaka_ccas',art_w_petaka_ccas)

#### 2.2.5 Braso

In [76]:
art_w_braso_ccas = art_words_ccas['braso']['numpy']

In [77]:
art_w_braso_ccas.shape #100,489

(100, 489)

In [78]:
df_art_w_braso = art_words_ccas['braso']['dataframe']
df_art_w_braso.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,479,480,481,482,483,484,485,486,487,488
AVPEPUDEAC0001braso.wav,-2.307987,-1.104407,-0.707962,-1.478731,-2.793554,-3.392212,-4.023394,-4.150622,-4.205185,-4.234404,...,3.600593,2.530478,2.646788,12.287413,12.62628,10.187714,4.432015,9.967844,6.792186,0.0
AVPEPUDEAC0003braso.wav,-1.513816,-1.265803,-1.040696,-1.024026,-1.360106,-1.852351,-2.396108,-2.367203,-2.814805,-2.891699,...,2.590442,2.521784,1.717729,4.078384,11.72583,6.845383,1.877062,18.573044,11.932847,0.0
AVPEPUDEAC0004braso.wav,-2.216458,-1.635965,-2.037848,-1.777778,-1.992181,-2.302595,-2.694359,-2.862033,-3.143958,-3.69214,...,2.003291,2.099486,2.012798,14.427509,11.323894,11.466559,2.635242,5.816457,3.239385,0.0
AVPEPUDEAC0005braso.wav,-2.29653,-1.358635,-1.794694,-1.998699,-2.150563,-2.075398,-2.514231,-3.295901,-3.483721,-3.589338,...,2.124112,1.743509,2.160303,13.029999,15.088421,7.607444,4.335193,4.639008,3.477537,0.0
AVPEPUDEAC0006braso.wav,-1.658551,-0.552369,-0.8366,-1.345115,-1.644357,-2.226558,-2.677129,-3.423217,-3.027067,-3.043681,...,1.5,1.5,1.5,8.148219,6.450195,5.980811,3.552013,5.545563,3.998205,0.0


In [79]:
np.save('CaracteristicasExtraidas/art_w_braso_ccas',art_w_braso_ccas)

-------------------
# 3. Extracción medidas prosódicas de audios completos<a id="proso"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

***Las características estáticas extraídas son 38 en total, en resumen son:***
>* 1-4, 14-16  : Medidas de la frecuencia fundamental.
>* 5-7, 33-38  : medidas de energía.
>* 9-13, 17-32  : medidas de Voiced-Unvoiced y pausas.

**Devolveremos un vector con esas 38 características para cada audio mas la clase etiquetada**

### 3.1 Extracción medidas prosódicas para la frase<a id="prosofra"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>

In [80]:
def prosody_readtext_extraction():
    return extraccion_ccas_directorio('prosody', 'PC-GITA/read-text/', 'prs_rt_hc.txt' , 'prs_rt_pd.txt' )

In [81]:
prs_rt_ccas = prosody_readtext_extraction()

In [82]:
assert prs_rt_ccas.shape == (100,39)
prs_rt_ccas.shape

(100, 39)

In [83]:
#creamos dataframe para su visualización
df_prs_rt = pd.DataFrame(prs_rt_ccas,index=name_rt)
df_prs_rt.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29,30,31,32,33,34,35,36,37,38
AVPEPUDEAC0001_readtext.wav,224.204269,55.103287,97.441279,346.709198,-32.613278,16.302759,-6.609341,1.659963,306.206897,226.889215,...,0.02,0.01,1.736842,0.222095,0.157068,2045.904843,0.00762,7.514159,5.647285,0.0
AVPEPUDEAC0003_readtext.wav,115.294716,18.56694,56.775624,169.191757,-38.071267,17.142829,-7.593128,1.412424,285.416667,162.043697,...,0.01,0.01,1.777778,-0.166861,-0.087782,1013.7909,0.023195,6.267308,4.445037,0.0
AVPEPUDEAC0004_readtext.wav,199.577942,37.672562,83.274297,294.983459,-32.739049,16.343679,-9.198493,1.873954,290.0,218.002294,...,0.04,0.01,1.56,-0.272972,-0.111719,790.795884,-0.016517,5.294076,4.667432,0.0
AVPEPUDEAC0005_readtext.wav,158.251373,32.527523,78.189821,239.559006,-29.890175,18.702501,-3.955363,1.811068,329.230769,307.307452,...,0.05,0.02,1.555556,-0.439032,-0.288996,311.330724,-0.029652,6.228063,5.094337,0.0
AVPEPUDEAC0006_readtext.wav,229.567856,47.42207,92.243365,343.436157,-34.480475,16.401607,-5.960529,1.404541,394.0,379.747284,...,0.04,0.02,1.733333,-0.07554,-0.055301,938.962934,0.00419,4.70548,4.163584,0.0


In [84]:
np.save('CaracteristicasExtraidas/prs_rt_ccas',prs_rt_ccas)

-------------

# 4. Limpieza de datos <a id="limp"></a><a href="#index"><i class="fa fa-list-alt" aria-hidden="true"></i></a>
Otra etapa importante dentro de la extracción de características son los pasos de limpieza de datos. En este paso se intentará solucionar todos los problemas con datos ruidosos o inexistentes. En nuestro caso, controlaremos los NaN producidos por los scripts de extracción de características.
### Enfoque de limpieza
1. Primero debemos **identificar en que sets de datos** se producen estos Nan.
2. Después debemos **identificar en qué audios y atributos** se producen los Nan
3. A partir de aquí tenemos dos enfoques
    > 1. Eliminar las instancias defectuosas.
    > 2. Sustituir el atributo por otro (media del atributo, media del atributo para su clase...)

#### Identificamos los sets de datos y audios en los que se producen los Nan
Devolvemos un diccionario con:
1. **clave = archivo numpy que contienen nan.**
2. **Valor= lista de 4 elementos:**
> 1. Set de ids de audios con Nan.
> 2. Número de audios con Nan.  
> 3. % de audios con nan de PD.
> 4. Lista de listas donde se identifica cada Nan concreto = [[idAud, idAtr],[idAud, idAtr]]

`dic['file.npy'] = [ {8,56}, 2, 0.5, [[8,2],[8,17],[56,3]] ]`

In [85]:
def identificadorNan(verbose=False):
    sets_ccas = [d for d in os.listdir('CaracteristicasExtraidas') if d.endswith('.npy')]
    ruta_ccas = './CaracteristicasExtraidas/'
    dic_nans = dict()
    for ccas in sets_ccas:
        data = np.load(ruta_ccas+ccas)
        if np.isnan(data).any():
            nan = np.argwhere(np.isnan(data))
            audiosconNaN = set(np.argwhere(np.isnan(data))[:,0])
            dic_nans[ccas] =  [audiosconNaN, len(audiosconNaN), len(np.where( np.array(list(audiosconNaN)) > 49 )[0])/len(audiosconNaN),nan]
            if verbose:
                print('\n--------------\n',ccas)
                print('\t(Audios, atrib): ',data.shape)
                print('\tAudios con NaN: ',audiosconNaN)
                print('\tNumero de audios con NaN: ',len(audiosconNaN))
                print('\t% de Nan en audios PD: ',dic_nans[ccas][2])
            
    return dic_nans

In [86]:
dc=identificadorNan(True)


--------------
 fon_w_atleta_ccas.npy
	(Audios, atrib):  (100, 30)
	Audios con NaN:  {57}
	Numero de audios con NaN:  1
	% de Nan en audios PD:  1.0

--------------
 fon_w_braso_ccas.npy
	(Audios, atrib):  (100, 30)
	Audios con NaN:  {28, 69, 15}
	Numero de audios con NaN:  3
	% de Nan en audios PD:  0.3333333333333333

--------------
 fon_w_gato_ccas.npy
	(Audios, atrib):  (100, 30)
	Audios con NaN:  {15, 18, 19, 27, 28, 34, 36, 37, 42, 45, 57, 60, 61, 67, 69, 70, 73, 80, 81, 83, 88, 91, 96, 98}
	Numero de audios con NaN:  24
	% de Nan en audios PD:  0.5833333333333334

--------------
 fon_w_petaka_ccas.npy
	(Audios, atrib):  (100, 30)
	Audios con NaN:  {57, 45, 77, 13}
	Numero de audios con NaN:  4
	% de Nan en audios PD:  0.5


In [87]:
#Ejemplo de pared idAudio-idAtrib donde hay un nan
dc['fon_w_braso_ccas.npy'][3]

array([[15,  5],
       [15, 12],
       [15, 19],
       [15, 26],
       [28,  5],
       [28, 12],
       [28, 19],
       [28, 26],
       [69,  5],
       [69, 12],
       [69, 19],
       [69, 26]], dtype=int64)

### Tratamiento de los Nan
Se ha optado por eliminar los audios que contienen nan. Se han identificado audios con nan en 4 de los 18 subsets de características y en la mayoría de ellos hay menos de 5 audios que contienen Nan. Concluimos que es un subconjunto pequeño de instancias y que puede ser completamente eliminada. 
Hay un único caso, las características de fonación de la palabra gato, donde hay más audios con atributos nan: 24 audios que contienen nan. También seguiremos la misma estrategia de eliminación de instancias con valores desconocidos.

In [88]:
def tratamiento_nan(dic_nan):
    '''Borramos de cada conjunto de caracerísticas las intancias con nan.'''
    for arch in dic_nan:
        ruta_ccas = './CaracteristicasExtraidas/'
        data = np.load(ruta_ccas+arch)
        data = np.delete(data, list(dic_nan[arch][0]), 0)
        np.save(ruta_ccas+arch, data)


In [89]:
tratamiento_nan(dc)

In [90]:
def ver(dic_nan):
    '''Vemos las nuevas dimensiones de los conjuntos de características 'limpiados' '''
    for arch in dic_nan:
        ruta_ccas = './CaracteristicasExtraidas/'
        data = np.load(ruta_ccas+arch)
        print(arch, data.shape)

In [91]:
ver(dc)

fon_w_atleta_ccas.npy (99, 30)
fon_w_braso_ccas.npy (97, 30)
fon_w_gato_ccas.npy (76, 30)
fon_w_petaka_ccas.npy (96, 30)
