# Text-based Model: Data Processing
-------

The scraped textual data is transcripted in the **CHAT** format that contains special notations to annotate the speech. These annotations contain additional information that cannot be inferred from the textual speech like:

> * **Babbling**: @b, @u, @wp, [babble], &=babble
> * **Repetition**: [=repeat ...], [x n] where n is a number, [/] 
> * **Best Guess** (not too clear words):  [?], [=? text]
> * **Unintelligible** (incomprehensible): [=jargon], xxx, xxxx, yyy
> * **Incompletion**: +..., +..?
> * **Onomatopoeia** (animal sounds and attempts to imitate natural sounds):@o
> * **Hesitation**: [//], [///], &+, [/?]
> * **Misspeling**: [: text]
> * **Disfluency**: (.), (..), (...), [/-], &Text:Text, &-Text, &Text
> * **Events**: &=text  

Generally events are visual notes of the transcripter on some actions that the child is doing like sneezing, jumping, imiting animals and so on. In the current version of the project, we are only intereseted in text( what did the child say not what did he do or how did he act). However, we cannot deny that such information can be helpful to distinguish children with ASD.

* **Step 1: Annotate Speech--** The first processing task is to classify all these annotations by topic as done above and replace each collection by one annotation. For example, whenever one of the annotations used to mention **babbling** is found in the speech, it is replaced by the key word **BAB**. The other collections are processed in the same way. In addition, we have ensured to use key words that do not pertain to the vocabulary used in the corpus to avoid biasing the models training afterwards.

* **Step 2: Common Processing--** After annotating the speech, we applied the common processing tasks such as removing foreign language speech, Mojibake, numbers, punctuation and extras spaces, tabs and line breaks.  

* **Step 3: Lemmatization--** In this third step, we applied a lemmatization task that extracts the root of each word. 

* **Step 4: Extract Meaningful Speech--** In this step, we used the english dictionnary provided with the **PyEnchant** library to extract only meaningful english words (e.g., bibobi is not a meaningful word).

* **Step 5: Structure Meaningful Speech--**  After extracting the meaningful words, we wanted to keep only some forms of words like subjects, nouns, verbs, adjectives and adverbs.

* **Step 6: Stem Lemmatized Speech--** In this step, we stemmed the already lemmatized speech to get more basic words.
 
* **Step 7: Stem Structured Speech--** In another attempt, we stemmed as well the structured speech that contains some forms of meaningful words. This last processing gives the smallest vocabulary size. 

> Note that the output of every processing task is saved as a new column in the csv file. At the end of the processing, we get 6 new columns all constructed from the raw speech passed as a parameter to the processing function. The new columns are:

* **clean_annotated_speech**: output of step 1&2.
* **lemmatized_speech**: output of step 3.
* **meaningful_speech**: output of step 4.
* **structured_speech**: output of step 5.
* **stemmed_lemmatized_speech**: output of step 6.
* **stemmed_meaningful_speech**: output of step 7.

> We apply these processing tasks, we defined a main function, **preprocess** that takes as parameters the dataset, the colomn on which the processing will be performed and the csv file path where the outputs will be saved. This main function uses other elementary functions to perform different tasks: Search, Tokenizing, Cleaning, Lemmatizing, Stemming, Structuring and so on. 

-----------


In [1]:
#Generic libs
import pandas as pd

# predefined modules
from modules import NLP_Functions as NLP_F

#global params
autism_path = 'data/autism_sample.csv'
preprocessed_dataset_path = 'data/preprocessed_autism.csv'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\softeam2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load data

In [None]:
autism_dataset = pd.read_csv(autism_path)
autism_dataset.head()

## Preprocessing

In [None]:
column ='speech'
NLP_F.preprocess(autism_path, column, preprocessed_dataset_path) 

# Preprocess test

In [3]:
text1='	haeboÉ¤boÉ¤@u m haeboÉ¤boÉ¤@u o and p !'
text2='	&=nods:yes .'
text3='\tnuhn [: nothing] .'
text4='\tridin(g) under de [: the] batrack@u [= track] .'
text5='\t&-uh baba@u'
text6 = 'I be going to go out to the kitchen dis alright '
text7 = '	in the , Monsters Inc tower , like show you , like .'
text8 = 'uni'
text9 = 'Hello [Repeat 3 times] yes [x 5]'
corpus =[text1, text2, text3, text4, text5,text6,text7, text8, text9]
test = pd.DataFrame(corpus, columns=['speech'])
test

Unnamed: 0,speech
0,\thaeboÉ¤boÉ¤@u m haeboÉ¤boÉ¤@u o and p !
1,\t&=nods:yes .
2,\tnuhn [: nothing] .
3,\tridin(g) under de [: the] batrack@u [= track] .
4,\t&-uh baba@u
5,I be going to go out to the kitchen dis alrigh...
6,"\tin the , Monsters Inc tower , like show you ..."
7,uni
8,Hello [Repeat 3 times] yes [x 5]


In [4]:
preprocessed_dataset_path = 'data/test_preprocess.csv'
column ='speech'
NLP_F.preprocess(test, column, preprocessed_dataset_path) 

Step 1: Annotate Speech


100%|██████████| 9/9 [00:00<00:00, 2912.49it/s]


Step 2: Common Preprocessing


100%|██████████| 9/9 [00:00<00:00, 8998.51it/s]


Step 3: Lemmatization


100%|██████████| 9/9 [00:00<00:00, 215.47it/s]


Step 4: Extract Meaningful Speech


100%|██████████| 9/9 [00:00<00:00, 333.97it/s]


Step 5: Structure Speech


100%|██████████| 9/9 [00:00<00:00, 296.13it/s]


Step 6: Stem Lemmatized Speech


100%|██████████| 9/9 [00:00<00:00, 4510.54it/s]


Step 7: Stem Structured Speech


100%|██████████| 9/9 [00:00<00:00, 9000.65it/s]

Ouf!! Save ...
Preprocessing is done, you find your clean data at data/test_preprocess.csv





In [5]:
df = pd.read_csv(preprocessed_dataset_path)
df

Unnamed: 0,speech,clean_annotated_speech,lemmatized_speech,meaningful_speech,structured_speech,stemmed_lemmatized_speech,stemmed_structured_speech
0,\thaeboÉ¤boÉ¤@u m haeboÉ¤boÉ¤@u o and p !,haebobo bab m haebobo bab o and p,haebobo bab m haebobo bab o and p,bab m bab o and p,bab bab o p,haebobo bab m haebobo bab o and p,bab bab o p
1,\t&=nods:yes .,,,,,,
2,\tnuhn [: nothing] .,nuhn mis,nuhn mis,mis,mis,nuhn mis,mis
3,\tridin(g) under de [: the] batrack@u [= track] .,ridin disf under de mis batrack bab,ridin disf under de mis batrack bab,disf under de mis bab,disf mis bab,ridin disf und de mis batrack bab,disf mis bab
4,\t&-uh baba@u,disf baba bab,disf baba bab,disf baba bab,disf bab,disf bab bab,disf bab
5,I be going to go out to the kitchen dis alrigh...,i be going to go out to the kitchen dis alright,I be go to go out to the kitchen dis alright,I be go to go out to the kitchen dis alright,I be go go kitchen alright,i be go to go out to the kitch dis alright,i be go go kitch alright
6,"\tin the , Monsters Inc tower , like show you ...",in the monsters inc tower like show you like,in the monsters inc tower like show you like,in the monsters inc tower like show you like,show you like,in the monst int tow lik show you lik,show you lik
7,uni,uni,uni,uni,uni,uni,uni
8,Hello [Repeat 3 times] yes [x 5],hello yes rep rep rep rep rep,hello yes rep rep rep rep rep,hello yes rep rep rep rep rep,hello yes rep rep rep rep rep,hello ye rep rep rep rep rep,hello ye rep rep rep rep rep
