# Text-based Model: Feature Engineering

-------   

In this part of the project, we want to dig deeper into the collected and preprocessed data to extract more relevant data for the training phase.  

-----------

In [1]:
#Generic libs
import pandas as pd
import csv

# predefined modules
from modules import NLP_Functions as NLP_F

#global params
dataset_path = 'data/preprocessed_autism.csv'
new_dataset_path = 'data/autism_with_metadata.csv'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\softeam2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load Data

In [2]:
data = pd.read_csv(dataset_path)
data.head()

Unnamed: 0,name,age,sex,speech,ASD,abs_age,clean_annotated_speech,lemmatized_speech,meaningful_speech,structured_speech,stemmed_lemmatized_speech,stemmed_structured_speech
0,Eigsti,5;03.10,female,\tokay .,1,5.0,okay,okay,okay,okay,okay,okay
1,Eigsti,5;03.10,female,\tdid you see this ?,1,5.0,did you see this,do you see this,do you see this,do you see this,do you see thi,do you see thi
2,Eigsti,5;03.10,female,\tyeah .,1,5.0,yeah,yeah,yeah,yeah,yeah,yeah
3,Eigsti,5;03.10,female,\txxx let's see +...,1,5.0,uni let's see inq,uni let us see inq,uni let us see inq,uni let we see inq,uni let us see inq,uni let we see inq
4,Eigsti,5;03.10,female,\txxx .,1,5.0,uni,uni,uni,uni,uni,uni


## Extract new features

> We want to extract **metadata** as **quantities** from the different forms of speech created in the previous phase. This metadata informs us about the linguistic skills level of each child.
> * **Length of clean annotated speech**: how expressive the child is ?
> * **Length of meaningful speech**: how developed is his vocabulary ?
> * **Length of structured speech**: how structured is his speech ?

Note that, by length we mean the number of words. Hence, we do not distinguish between short and long words. We excluded the annotations key words from the count. 

> * **Occurrence number of each annotation** (number of repetition, number of babbling and so on): this is a very important piece of information because these annotations contain the common signs of linguistic disorders.
> * **Number of different used words**: how wide-ranging is the child's vocabulary ?
> * **Speech density** : number of characters used in the speech without considering whitespaces and annotations. This informs us whether the child tends to use short or long words. 

In [3]:
%%time
NLP_F.extract_features(data, new_dataset_path)

Step 1: Compute speech length 1/3
Step 1: Compute speech length 2/3
Step 1: Compute speech length 3/3
Step 2: Compute number of each annotation
Step 3: convert age to months
Step 4: compute number of different used words
Step 5: compute density of speech
Save!! ...
Features Extracting is done, you find your extracted data at data/autism_with_metadata.csv
Wall time: 6min 38s


In [4]:
data_with_metadata = pd.read_csv(new_dataset_path)
data_with_metadata.head()

Unnamed: 0,name,age,sex,speech,ASD,abs_age,clean_annotated_speech,lemmatized_speech,meaningful_speech,structured_speech,...,n_uni,n_rep,n_inq,n_ono,n_hes,n_mis,n_disf,age_in_months,n_diff_words,density
0,Eigsti,5;03.10,0,\tokay .,1,5.0,okay,okay,okay,okay,...,0,0,0,0,0,0,0,63.0,1,4
1,Eigsti,5;03.10,0,\tdid you see this ?,1,5.0,did you see this,do you see this,do you see this,do you see this,...,0,0,0,0,0,0,0,63.0,4,13
2,Eigsti,5;03.10,0,\tyeah .,1,5.0,yeah,yeah,yeah,yeah,...,0,0,0,0,0,0,0,63.0,1,4
3,Eigsti,5;03.10,0,\txxx let's see +...,1,5.0,uni let's see inq,uni let us see inq,uni let us see inq,uni let we see inq,...,1,0,1,0,0,0,0,63.0,3,8
4,Eigsti,5;03.10,0,\txxx .,1,5.0,uni,uni,uni,uni,...,1,0,0,0,0,0,0,63.0,0,0


In [5]:
data_with_metadata.columns

Index(['name', 'age', 'sex', 'speech', 'ASD', 'abs_age',
       'clean_annotated_speech', 'lemmatized_speech', 'meaningful_speech',
       'structured_speech', 'stemmed_lemmatized_speech',
       'stemmed_structured_speech', 'len_clean_annotated_speech',
       'len_meaningful_speech', 'len_structured_speech', 'n_bab', 'n_gue',
       'n_uni', 'n_rep', 'n_inq', 'n_ono', 'n_hes', 'n_mis', 'n_disf',
       'age_in_months', 'n_diff_words', 'density'],
      dtype='object')