# Early Autism Detection

## Exploratory Data Analysis

### Load The Autistic Spectrum Disorder Screening Data for Children Dataset

**Description:** This dataset contains information related to the screening of autistic spectrum disorder (ASD) in children. It includes various demographic and behavioral features that are used to identify potential ASD cases.

**Attributes:**

- **A1_Score:** Integer - The answer code for the first question in the AQ-10-Child questionnaire (0 or 1).

- **A2_Score:** Integer - The answer code for the second question in the AQ-10-Child questionnaire (0 or 1).

- **A3_Score:** Integer - The answer code for the third question in the AQ-10-Child questionnaire (0 or 1).

- **A4_Score:** Integer - The answer code for the fourth question in the AQ-10-Child questionnaire (0 or 1).

- **A5_Score:** Integer - The answer code for the fifth question in the AQ-10-Child questionnaire (0 or 1).

- **A6_Score:** Integer - The answer code for the sixth question in the AQ-10-Child questionnaire (0 or 1).

- **A7_Score:** Integer - The answer code for the seventh question in the AQ-10-Child questionnaire (0 or 1).

- **A8_Score:** Integer - The answer code for the eighth question in the AQ-10-Child questionnaire (0 or 1).

- **A9_Score:** Integer - The answer code for the ninth question in the AQ-10-Child questionnaire (0 or 1).

- **A10_Score:** Integer - The answer code for the tenth question in the AQ-10-Child questionnaire (0 or 1).

- **age:** Integer - Age of the individual in years.

- **gender:** Categorical - Gender of the individual (Male or Female).

- **ethnicity:** Categorical - List of common ethnicities in text format.

- **jaundice:** Binary - Whether the individual was born with jaundice (yes or no).

- **autism:** Binary - Whether any immediate family member has a pervasive developmental disorder (PDD) (yes or no).

- **country_of_res:** Categorical - List of countries in text format.

- **used_app_before:** Binary - Whether the user has used a screening app before (yes or no).

- **result:** Integer - The final score obtained based on the scoring algorithm of the screening method used.

- **age_desc:** Categorical - Description of the age category.

- **relation:** Categorical - The person completing the test (Parent, self, caregiver, medical staff, clinician, etc.).

- **class:** Binary - The target variable indicating whether the individual is classified as having ASD (yes or no).

**Source:** Thabtah, F. (2017). Autistic Spectrum Disorder Screening Data for Children [Dataset]. UCI Machine Learning Repository. Retrieved from https://doi.org/10.24432/C5659W.


In [1]:

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from ucimlrepo import fetch_ucirepo 
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import squarify

In [2]:
autistic_spectrum_disorder_screening_data_for_children = fetch_ucirepo( id=419 ) 
X = autistic_spectrum_disorder_screening_data_for_children.data.features 
y = autistic_spectrum_disorder_screening_data_for_children.data.targets 
df_original = pd.concat( [X, y], axis=1 )
df = df_original.copy()

## Dataset Inspection

The dataset has **292** records and **21** features (variables).

In [3]:
feature_names = df.columns
print( feature_names )
print( df.shape )

Index(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 'A6_Score',
       'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'age', 'gender',
       'ethnicity', 'jaundice', 'autism', 'country_of_res', 'used_app_before',
       'result', 'age_desc', 'relation', 'class'],
      dtype='object')
(292, 21)


### Data Types and Data Head and Tail

In [4]:
print( df.dtypes )

A1_Score             int64
A2_Score             int64
A3_Score             int64
A4_Score             int64
A5_Score             int64
A6_Score             int64
A7_Score             int64
A8_Score             int64
A9_Score             int64
A10_Score            int64
age                float64
gender              object
ethnicity           object
jaundice            object
autism              object
country_of_res      object
used_app_before     object
result               int64
age_desc            object
relation            object
class               object
dtype: object


In [5]:
df.head()

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jaundice,autism,country_of_res,used_app_before,result,age_desc,relation,class
0,1,1,0,0,1,1,0,1,0,0,...,m,Others,no,no,Jordan,no,5,'4-11 years',Parent,NO
1,1,1,0,0,1,1,0,1,0,0,...,m,'Middle Eastern ',no,no,Jordan,no,5,'4-11 years',Parent,NO
2,1,1,0,0,0,1,1,1,0,0,...,m,,no,no,Jordan,yes,5,'4-11 years',,NO
3,0,1,0,0,1,1,0,0,0,1,...,f,,yes,no,Jordan,no,4,'4-11 years',,NO
4,1,1,1,1,1,1,1,1,1,1,...,m,Others,yes,no,'United States',no,10,'4-11 years',Parent,YES


In [6]:
df.tail()

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jaundice,autism,country_of_res,used_app_before,result,age_desc,relation,class
287,1,1,1,1,1,1,1,1,1,1,...,f,White-European,yes,yes,'United Kingdom',no,10,'4-11 years',Parent,YES
288,1,0,0,0,1,0,1,0,0,1,...,f,White-European,yes,yes,Australia,no,4,'4-11 years',Parent,NO
289,1,0,1,1,1,1,1,0,0,1,...,m,Latino,no,no,Brazil,no,7,'4-11 years',Parent,YES
290,1,1,1,0,1,1,1,1,1,1,...,m,'South Asian',no,no,India,no,9,'4-11 years',Parent,YES
291,0,0,1,0,1,0,1,0,0,0,...,f,'South Asian',no,no,India,no,3,'4-11 years',Parent,NO


### Missing Values

The columns with missing values are **ethnicity** with **43** missing values, **relation** with **43** missing values, and **age** with **4** missing values.

In [7]:
df.isna().sum()

A1_Score            0
A2_Score            0
A3_Score            0
A4_Score            0
A5_Score            0
A6_Score            0
A7_Score            0
A8_Score            0
A9_Score            0
A10_Score           0
age                 4
gender              0
ethnicity          43
jaundice            0
autism              0
country_of_res      0
used_app_before     0
result              0
age_desc            0
relation           43
class               0
dtype: int64

### Duplicates

There are 2 duplicates but they have different values in different columns so I'll keep them.

In [8]:
duplicates = df.duplicated()
print( df[duplicates] )

    A1_Score  A2_Score  A3_Score  A4_Score  A5_Score  A6_Score  A7_Score  \
84         0         0         1         0         1         1         1   
93         0         0         1         1         1         1         1   

    A8_Score  A9_Score  A10_Score  ...  gender ethnicity jaundice autism  \
84         0         1          1  ...       m     Asian       no     no   
93         1         1          1  ...       m     Asian       no     no   

   country_of_res used_app_before result      age_desc relation class  
84          India              no      6  '4-11 years'   Parent    NO  
93          India              no      8  '4-11 years'   Parent   YES  

[2 rows x 21 columns]


### Outliers

There's **1 outlier** in the **result** column and this is because the **result** column is obtained by adding the **first ten** columns. The value for these first ten columns is 0 and therefore the value for the **result** column is 0. I'll ignore this outlier since it is a valid record.

In [9]:
def detect_outliers_iqr( df, column ):
    Q1 = df[column].quantile( 0.25 )
    Q3 = df[column].quantile( 0.75 )
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
 
    outliers = df[( df[column] < lower_bound ) | ( df[column] > upper_bound )]
    
    return outliers

numerical_features = df.select_dtypes( include=['number'] ).columns

for feature in numerical_features:
    outliers = detect_outliers_iqr( df, feature )
    print( f"Number of outliers in {feature}: {len( outliers )}" )
    if not outliers.empty:
        print( outliers )

Number of outliers in A1_Score: 0
Number of outliers in A2_Score: 0
Number of outliers in A3_Score: 0
Number of outliers in A4_Score: 0
Number of outliers in A5_Score: 0
Number of outliers in A6_Score: 0
Number of outliers in A7_Score: 0
Number of outliers in A8_Score: 0
Number of outliers in A9_Score: 0
Number of outliers in A10_Score: 0
Number of outliers in age: 0
Number of outliers in result: 1
     A1_Score  A2_Score  A3_Score  A4_Score  A5_Score  A6_Score  A7_Score  \
137         0         0         0         0         0         0         0   

     A8_Score  A9_Score  A10_Score  ...  gender ethnicity jaundice autism  \
137         0         0          0  ...       f  Hispanic       no     no   

      country_of_res used_app_before result      age_desc relation class  
137  'United States'              no      0  '4-11 years'   Parent    NO  

[1 rows x 21 columns]


In [10]:
for feature in numerical_features:
    outliers = detect_outliers_iqr( df, feature )
    print( f"Number of outliers in {feature}: {len( outliers )}" )
    if not outliers.empty:
        print( f"Outliers in {feature}:" )
        print( outliers[['result']] )

Number of outliers in A1_Score: 0
Number of outliers in A2_Score: 0
Number of outliers in A3_Score: 0
Number of outliers in A4_Score: 0
Number of outliers in A5_Score: 0
Number of outliers in A6_Score: 0
Number of outliers in A7_Score: 0
Number of outliers in A8_Score: 0
Number of outliers in A9_Score: 0
Number of outliers in A10_Score: 0
Number of outliers in age: 0
Number of outliers in result: 1
Outliers in result:
     result
137       0


### Data Imbalance

The dataset is relatively balanced with a slight majority of **NO** instances. No concern for data imbalance.

In [11]:
class_distribution = df['class'].value_counts()

print( "Class distribution:" )
print( class_distribution )

class_percentage = df['class'].value_counts( normalize = True ) * 100

print( "\nClass percentage distribution:" )
for index, value in class_percentage.items():
    print( f"{index}: {value:.2f}%" )

Class distribution:
class
NO     151
YES    141
Name: count, dtype: int64

Class percentage distribution:
NO: 51.71%
YES: 48.29%
