# Text-based Model: Exploratory Data Analysis

-------   

In this last part before training, we want to understand more profoundly the data to have a preliminary vision about how we can tackle the training phase.

-----------

In [10]:
#Generic libs
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

#global params
dataset_path = 'data/autism_with_metadata.csv'

## Load Data

In [None]:
data = pd.read_csv(dataset_path)
data.head()

## Analysis

In [None]:
asd_filter = data['ASD']==1

### Gender Analysis

In [None]:
df_sex= data[asd_filter].groupby(by=['sex']).count().reset_index()
plt.bar(x =['Female', 'Male'], height=df_sex['name'])
plt.xlabel("Gender")
plt.ylabel("Number of Children")
plt.title("Number of autistic boys vs autistic girls")
plt.show()

![image](img/gender.png)

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
Autism is significantly more common among boys than among girls. This skewed sex ratio is well recognized in autism statistics supplied by the <a href='https://www.cdc.gov/'>Centers for Disease Control and Prevention</a>.
</div> </pre> 

### Linguistic Abilities Analysis

In [None]:
df_eda= data.groupby(by=['ASD']).mean().reset_index()

In [None]:
X = ['Speech','Meaningful Speech','Structured Speech', 'Different Words', 'Density']
cols = ['len_clean_annotated_speech', 'len_meaningful_speech', 'len_structured_speech', 'n_diff_words', 'density']
ASD_count = df_eda[cols].iloc[1]
No_ASD_count = df_eda[cols].iloc[0]

X_axis = np.arange(len(X))

plt.figure(figsize=(10,5))
plt.bar(X_axis - 0.2, ASD_count, 0.4, label = 'ASD', color = 'r')
plt.bar(X_axis + 0.2, No_ASD_count, 0.4, label = 'NO ASD', color = 'g')

plt.xticks(X_axis, X)
plt.ylabel("Mean")
plt.title("Linguistic Abilities Analysis")
plt.legend()
plt.show()

![image](img/linguistic.png)

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
Clearly, children with ASD tend to use <b>shorter, less meaningful, less structured</b> sentences. They use less number of different words because of their <b>limited vocabulary</b>. Finally, their speech is <b>less dense</b> than the speech supplied by children that do not present the autism disorder because they tend to use shorter words.
</div> </pre> 

### Autism Signs Analysis

In [None]:
X = ['Babbling','Repetition','Best Guess', 'Unintelligible', 'Incompletion', 'Onomatopoeia', 'Hesitation', 'Misspelling', 'Disfluency']
cols = ['n_bab', 'n_rep', 'n_gue', 'n_uni', 'n_inq', 'n_ono', 'n_hes', 'n_mis', 'n_disf']
ASD_count = df_eda[cols].iloc[1]
No_ASD_count = df_eda[cols].iloc[0]

X_axis = np.arange(len(X))

plt.figure(figsize=(15,5))
plt.bar(X_axis - 0.2, ASD_count, 0.4, label = 'ASD', color = 'r')
plt.bar(X_axis + 0.2, No_ASD_count, 0.4, label = 'NO ASD', color = 'g')

plt.xticks(X_axis, X)
plt.ylabel("Mean")
plt.title("Autism Signs Analysis")
plt.legend()
plt.show()

![image](img/symptoms.png)

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
<li> Obviously, <b>Babbling</b>, <b>Guessed</b> and <b>Unintelligible</b> words, <b>Incompletion</b>, <b>Onomatopeia</b> and <b>Hesitation</b> are confirmed signs for autism. 

<li> Surprisingly, <b>Disfluency</b> and <b>Misspelling</b> are more frequent among children that do not present autism. This can be explained by the fact that those children have more developed linguistic skills and wide-ranging vocabulary. Hence, they use more words and consquently make more errors.

<li> On the other hand, the <b>Repetition</b> sign which is the most recongnized symptom to distinguish children with ASD, known as <b>Echololia</b>, is not really informative in our dataset. This can be a result of pure random chance. Anyway, we should dig deeper into this last fact.
</div> </pre> 

## Conclusion   
This first analysis confirms that the textual data clearly contains a **predictive value** that is too useful to distinguish children with autism disorder.