# Text-based Model: Data Collection
-------
>To get the **textual** data that contains the children's speech, we had recourse to the [TalkBank](https://talkbank.org/), that is a multilingual **corpus** established in 2002 to foster fundamental research in the study of human and animal communication. We scraped speech of children **with** and **without Autism Spectrum Disaster (ASD)**. 

> We have ensured that the children that do not present the ASD **do not present other linguistic disorders**. 

> The textual data on the TalkBank is organised by **topic** (CHILDS, ASDBank, PhonBank, AphasiaBank, ....). Under each topic, there are other classification by language (English, French, Chinese, ...). Once the right repository is found, all the speech data in the TalkBank is transcripted in the **CHAT** format as files with the extension **.cha**. In the header of each cha file, a description of the context is found. We parsed that description to find out if the child presents the ASD or any other disorder. 

> The scraping of the children's speech is made as follows:
>> **Step 1)** **Scrape** the **URL**s of the diffrent **repostories** where cha files are saved using the criterium: **ASD or NO-ASD**.   
**Step 2)** **Scrape** the **URL**s of the **cha files** that contain the children speech.      
**Step 3)** **Scrape and Parse** the **cha files content** using a **yaml** file that specifies the elements to be scraped. After parsing the content of the cha file, only the relevant elements are saved into a **csv** file. The **name, age, sex** and the **target** value are scraped from the header of the cha file. The **target** value is induced as **0** for the children without ASD and **1** for those with the ASD. The **speech** is scraped from the corpus of the cha file.

> The content scraping, parsing and  saving are executed simultaniously to decrease the number of variables (storage units) used in the process.

-------

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
We cannot share the data for privacy reasons but you can refer to the following documentation to have more information about the cha files: [TalkBank Documentation](https://talkbank.org/manuals/CHAT.pdf).
</div> </pre> 


In [None]:
import pandas as pd 
import csv

# Predefined Modules
from modules import Scrape_Functions as Scrape_F

# global params
yml_file = 'data/url.yml'
autism_path = 'data/autism.csv'
sample_age_path = 'data/autism_sample_age.csv'
sample_path = 'data/autism_sample.csv'

seed = 0

# Children with ASD

In [None]:
asd_urls_path = 'data/urls_asd.txt'
asd_rep_urls_path = 'data/asd_rep_urls'
asd_cha_files_path= 'data/asd_cha_files_urls'

asd_csv_file_path = 'data/asd_speech.csv'

## Scrape Repositories URLs

In [None]:
%%time
asd_rep_urls = Scrape_F.scrape_repositories_urls(yml_file, asd_urls_path)
Scrape_F.list_to_txt(asd_rep_urls_path, asd_rep_urls)
print(f'we have scraped {len(asd_rep_urls)} repositories urls')

## Scraping cha-files urls

In [None]:
%%time
asd_cha_files_urls = Scrape_F.scrape_urls_files(yml_file, asd_rep_urls)
Scrape_F.list_to_txt(asd_cha_files_path, asd_cha_files_urls)
print(f'we have scraped {len(asd_cha_files_urls)} cha files urls')

## Scraping and Parsing cha files content

In [None]:
%%time
Scrape_F.scrape_and_parse_cha_files(asd_cha_files_urls, asd_csv_file_path, asd=1)

# Children with No ASD

In [None]:
no_asd_urls_path = 'data/no_asd_urls.txt'
no_asd_rep_urls_path = 'data/no_asd_rep_urls'
no_asd_cha_files_path= 'data/no_asd_cha_files_urls' 

no_asd_csv_file_path = 'data/no_asd_speech.csv'

## Scrape Repositories URLs

In [None]:
%%time
no_asd_rep_urls = Scrape_F.scrape_repositories_urls(yml_file, no_asd_urls_path)
Scrape_F.list_to_txt(no_asd_rep_urls_path, no_asd_rep_urls)
print(f'we have scraped {len(rep_urls)} repositories urls')

## Scraping cha-files urls

In [None]:
%%time
no_asd_cha_files_urls = Scrape_F.scrape_urls_files(yml_file, no_asd_rep_urls)
Scrape_F.list_to_txt(no_asd_cha_files_path, no_asd_cha_files_urls)
print(f'we have scraped {len(no_asd_cha_files_urls)} cha files urls')

## Scraping and Parsing cha files content

In [None]:
%%time
Scrape_F.scrape_and_parse_cha_files(no_asd_cha_files_urls, no_asd_csv_file_path, asd=0)

# Combine the datasets

In [None]:
df_no_asd=pd.read_csv(no_asd_csv_file_path)
df_asd= pd.read_csv(asd_csv_file_path)

In [None]:
print(df_no_asd.shape)
print(df_asd.shape)
print(df_no_asd.columns)
print(df_asd.columns)

In [None]:
df_no_asd.head()

In [None]:
df_asd.head()

In [None]:
result = pd.concat([df_asd,df_no_asd])
result

In [None]:
# save the combined datasets
result.to_csv(autism_path, index=False, quoting=csv.QUOTE_NONNUMERIC)

# Filter the data

In [None]:
asd_filter = result['ASD'] == 1
print(f'ASD obsrevations:{result[asd_filter].shape[0]} obsrevations ({round(result[asd_filter].shape[0]/result.shape[0]*100,2)}%)')
print(f'No ASD obsrevations:{result[~asd_filter].shape[0]} obsrevations ({round(result[~asd_filter].shape[0]/result.shape[0]*100,2)}%)')

> We have scraped **832_675** speeches (records) where only **5%** of the records are speeches of children with ASD. A **Resampling** technique may be used to decrease the huge gap in number between both categories.

> In addition, children that are represented in this dataset are aged between **1** and **12** years old. Before **3** years old, children are always developing their linguistic skills and expanding their vocabulary so it is too hard to distinguish children with ASD using only speech. However, we should mention that psychologists confirm that ASD can be discovered before **3** using other symptoms, other than linguistic abilities, such as stereotyped and repetitive motor movements (e.g., hand flapping or lining up items). In addition, some linguistic habits such as babbling are very frequent among children before **3** years and they are considered as positive signs of normal development. However, this same linguistic habit is considered as a negative sign when it is present among older children. For those reasons, we want to filter the dataset so that children before **3** are not considered in the current version of the project.

> After **6** years old, typically, children developed their linguistic skills so that it is too easy to distinguish children with ASD using their linguistic abilities. Hence, we will set the limit age in the current version of the project to **6** years old.

## Filter per age

In [None]:
# 1. Extract the absolute age (6;11.19 ---> 6)
result['abs_age'] = result['age'].apply(Scrape_F.abs_age)

# 2. Exclude children aged less than 3 years and more than 6 years  
age_filter = (result['abs_age']>2) & (result['abs_age']<7)
sample_age = result[age_filter]

# 3. save the sample in a csv file
sample_age.to_csv(sample_age_path, index=False, quoting=csv.QUOTE_NONNUMERIC)

## Resampling

> To decrease the gap between the number of children with ASD and the number of those without the disorder, we had recourse to the resampling technique. Two approaches as possible: Over-Sampling the minority class or Under-Sampling the majority class. As we have a huge number of children without ASD, we opted for the Under-Sampling approach. Note that, in real world presentation, such datasets are typically imbalanced. Hence, we will not try to balance the dataset, but, we wanted to decrease the gap between both classes while keeping the imbalanced nature of the dataset to get sufficient amount of data to train the classification model afterwords. 

In [None]:
# Resampling
'''
Assume that the real population has 30% of children with ASD. 
We should compute the number of records to keep from the majority class (children without ASD).
Let p1 be the percentage of children with autism and p2 is the percentage of children without ASD.
# NO_ASD records = (p2/p1)* # ASD records
''' 
p1 = 0.3
p2 = 0.7
ASD = sample_age[asd_filter].shape[0]
print('ASD:', ASD)
NO_ASD = int(round(p2/p1 * ASD,0))
print('NO_ASD:', NO_ASD)
print('ALL:', NO_ASD + ASD)

sample = pd.concat([sample_age[~asd_filter].sample(n = NO_ASD, random_state = seed), sample_age[asd_filter]], axis=0)

# check
print(f'ASD obsrevations:{sample[asd_filter].shape[0]} obsrevations ({round(sample[asd_filter].shape[0]/sample.shape[0]*100,2)}%)')
print(f'No ASD obsrevations:{sample[~asd_filter].shape[0]} obsrevations ({round(sample[~asd_filter].shape[0]/sample.shape[0]*100,2)}%)')

In [None]:
# save the sample
sample.to_csv(sample_path, index=False, quoting=csv.QUOTE_NONNUMERIC)

>🗒 Here is the scraped data structure. We had to hide the names for privacy reasons.

![image](img/autism_ds.png)