Title: Autism Screening Adult Data Set
 -----------------------------------------

Number of Instances:704

Attribute Characteristics: Integer

Number of Attributes:21

Date Donated 2017-12-24

Associated Tasks: Classification

Missing Values? Yes

Number of Web Hits: 84051

         Attribute                        Domain

       1. A1_Score                       {0,1}
       2. A2_Score                       {0,1}
       3. A3_Score                       {0,1}
       4. A4_Score                       {0,1}
       5. A5_Score                       {0,1}
       6. A6_Score                       {0,1}
       7. A7_Score                       {0,1}
       8. A8_Score                       {0,1}
       9. A9_Score                       {0,1}
      10. A10_Score                      {0,1}
      11. age                            numeric
      12. gender                         {f,m}
      13. ethnicity                      {White-European,Latino,Others,Black,Asian,'Middle Eastern ',Pasifika,'South asian',Hispanic,Turkish,others}
      14.jundice                         {no,yes}
      15.austim                          {no,yes}
      16.contry_of_res                   {'United States',Brazil,Spain,Egypt,'New Zealand',Bahamas,Burundi,Austria,Argentina,Jordan,Ireland,'United Arab Emirates',Afghanistan,Lebanon,'United Kingdom','South Africa',Italy,Pakistan,Bangladesh,Chile,France,China,Australia,Canada,'Saudi Arabia',Netherlands,Romania,Sweden,Tonga,Oman,India,Philippines,'Sri Lanka','Sierra Leone',Ethiopia,'Viet Nam',Iran,'Costa Rica',Germany,Mexico,Russia,Armenia,Iceland,Nicaragua,'Hong Kong',Japan,Ukraine,Kazakhstan,AmericanSamoa,Uruguay,Serbia,Portugal,Malaysia,Ecuador,Niger,Belgium,Bolivia,Aruba,Finland,Turkey,Nepal,Indonesia,Angola,Azerbaijan,Iraq,'Czech Republic',Cyprus}
      17.used_app_before                 {no,yes}
      18.result                          numeric
      19.age_desc                        {'18 and more'}
      20.relation                        {Self,Parent,'Health care professional',Relative,Others}
      21.Class/ASD                       {NO,YES}

#### Features Description

    Feature	: Description
    index : The participant’s ID number
    AX_Score: Score based on the Autism Spectrum Quotient (AQ) 10 item screening tool AQ-10
    age : Age in years
    gender : Male or Female
    ethnicity: Ethnicities in text form
    jaundice : Whether or not the participant was born with jaundice?
    autism : Whether or not anyone in tbe immediate family has been diagnosed with autism?
    country_of_res : Countries in text format
    used_app_before : Whether the participant has used a screening app
    result	Score from the AQ-10 screening tool
    age_desc : Age as categorical
    relation : Relation of person who completed the test
    Class/ASD :	Participant classification

## Importing Libraries

In [27]:
# For dataframe and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.io import arff

# Processing data
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

## Reading Data

In [28]:
# chargement des données
# data = arff.loadarff('Autism-Adult-Data.arff')
dataset = pd.read_table('Autism-Adult-Data.arff', sep = ',')

In [29]:
# df = pd.DataFrame(data[0])
df = dataset.copy()

In [30]:
# Rename columns
df.columns = ['A1_Score','A2_Score','A3_Score','A4_Score','A5_Score','A6_Score','A7_Score','A8_Score','A9_Score','A10_Score','age','gender','ethnicity','jundice','austim','contry_of_res','used_app_before' ,'result' 'numeric','age_desc','relation','Class/ASD']

In [31]:
df.head()

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,resultnumeric,age_desc,relation,Class/ASD
0,1,1,0,1,0,0,0,1,0,1,...,m,Latino,no,yes,Brazil,no,5,'18 and more',Self,NO
1,1,1,0,1,1,0,1,1,1,1,...,m,Latino,yes,yes,Spain,no,8,'18 and more',Parent,YES
2,1,1,0,1,0,0,1,1,0,1,...,f,White-European,no,yes,'United States',no,6,'18 and more',Self,NO
3,1,0,0,0,0,0,0,1,0,0,...,f,?,no,no,Egypt,no,2,'18 and more',?,NO
4,1,1,1,1,1,0,1,1,1,1,...,m,Others,yes,no,'United States',no,9,'18 and more',Self,YES


In [32]:
df.index #Describe index

RangeIndex(start=0, stop=703, step=1)

In [33]:
df.shape

(703, 21)

In [34]:
df.count() #Number of non-NA values

A1_Score           703
A2_Score           703
A3_Score           703
A4_Score           703
A5_Score           703
A6_Score           703
A7_Score           703
A8_Score           703
A9_Score           703
A10_Score          703
age                703
gender             703
ethnicity          703
jundice            703
austim             703
contry_of_res      703
used_app_before    703
resultnumeric      703
age_desc           703
relation           703
Class/ASD          703
dtype: int64

## Feature Engineering

In [35]:
df.info() #Info on DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 703 entries, 0 to 702
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   A1_Score         703 non-null    int64 
 1   A2_Score         703 non-null    int64 
 2   A3_Score         703 non-null    int64 
 3   A4_Score         703 non-null    int64 
 4   A5_Score         703 non-null    int64 
 5   A6_Score         703 non-null    int64 
 6   A7_Score         703 non-null    int64 
 7   A8_Score         703 non-null    int64 
 8   A9_Score         703 non-null    int64 
 9   A10_Score        703 non-null    int64 
 10  age              703 non-null    object
 11  gender           703 non-null    object
 12  ethnicity        703 non-null    object
 13  jundice          703 non-null    object
 14  austim           703 non-null    object
 15  contry_of_res    703 non-null    object
 16  used_app_before  703 non-null    object
 17  resultnumeric    703 non-null    in

Some columns are object and some of them has string Yes or No, we need to replace them to boolean (0, 1)

In [36]:
df = df.replace("yes", 1)
df = df.replace("no", 0)
df = df.replace("YES", 1)
df = df.replace("NO", 0)
df = df.replace("f", 1)
df = df.replace("m", 0)

In [37]:
df.info() #Info on DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 703 entries, 0 to 702
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   A1_Score         703 non-null    int64 
 1   A2_Score         703 non-null    int64 
 2   A3_Score         703 non-null    int64 
 3   A4_Score         703 non-null    int64 
 4   A5_Score         703 non-null    int64 
 5   A6_Score         703 non-null    int64 
 6   A7_Score         703 non-null    int64 
 7   A8_Score         703 non-null    int64 
 8   A9_Score         703 non-null    int64 
 9   A10_Score        703 non-null    int64 
 10  age              703 non-null    object
 11  gender           703 non-null    int64 
 12  ethnicity        703 non-null    object
 13  jundice          703 non-null    int64 
 14  austim           703 non-null    int64 
 15  contry_of_res    703 non-null    object
 16  used_app_before  703 non-null    int64 
 17  resultnumeric    703 non-null    in

In [38]:
MissingValues = {col:df[df[col] == "?"].shape[0] for col in df.columns}
MissingValues

{'A1_Score': 0,
 'A2_Score': 0,
 'A3_Score': 0,
 'A4_Score': 0,
 'A5_Score': 0,
 'A6_Score': 0,
 'A7_Score': 0,
 'A8_Score': 0,
 'A9_Score': 0,
 'A10_Score': 0,
 'age': 2,
 'gender': 0,
 'ethnicity': 95,
 'jundice': 0,
 'austim': 0,
 'contry_of_res': 0,
 'used_app_before': 0,
 'resultnumeric': 0,
 'age_desc': 0,
 'relation': 95,
 'Class/ASD': 0}

### Replace '?' values of Age by mean

In [39]:
for j in range(df.shape[0]):
    if(df.iloc[j,10]=='?'):
        df.iloc[j,10]=np.NaN

In [40]:
df.dropna(inplace= True)

In [41]:
df['age'] = df['age'].str.replace(',','').astype(int)

### Replace '?' values of ethnicity by 'Others' and 'others' by 'Others'

In [42]:
df['ethnicity'].unique()

array(['Latino', 'White-European', '?', 'Others', 'Black', 'Asian',
       "'Middle Eastern '", 'Pasifika', "'South Asian'", 'Hispanic',
       'Turkish', 'others'], dtype=object)

In [43]:
df['ethnicity'] = df['ethnicity'].replace('?', 'others')

In [44]:
df['ethnicity'] = df['ethnicity'].replace('others', 'Others')

In [45]:
df['ethnicity'].unique()

array(['Latino', 'White-European', 'Others', 'Black', 'Asian',
       "'Middle Eastern '", 'Pasifika', "'South Asian'", 'Hispanic',
       'Turkish'], dtype=object)

### Replace '?' values of relation by a  mode of relation

In [46]:
df['relation'].unique()

array(['Self', 'Parent', '?', "'Health care professional'", 'Relative',
       'Others'], dtype=object)

In [47]:
df['relation'] = df['relation'].replace('?', df['relation'].mode()[0])

In [48]:
df['relation'].unique()

array(['Self', 'Parent', "'Health care professional'", 'Relative',
       'Others'], dtype=object)

In [49]:
df.isnull().sum() #Number of NA values

A1_Score           0
A2_Score           0
A3_Score           0
A4_Score           0
A5_Score           0
A6_Score           0
A7_Score           0
A8_Score           0
A9_Score           0
A10_Score          0
age                0
gender             0
ethnicity          0
jundice            0
austim             0
contry_of_res      0
used_app_before    0
resultnumeric      0
age_desc           0
relation           0
Class/ASD          0
dtype: int64

In [50]:
df.info() #Info on DataFrame

<class 'pandas.core.frame.DataFrame'>
Int64Index: 701 entries, 0 to 702
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   A1_Score         701 non-null    int64 
 1   A2_Score         701 non-null    int64 
 2   A3_Score         701 non-null    int64 
 3   A4_Score         701 non-null    int64 
 4   A5_Score         701 non-null    int64 
 5   A6_Score         701 non-null    int64 
 6   A7_Score         701 non-null    int64 
 7   A8_Score         701 non-null    int64 
 8   A9_Score         701 non-null    int64 
 9   A10_Score        701 non-null    int64 
 10  age              701 non-null    int64 
 11  gender           701 non-null    int64 
 12  ethnicity        701 non-null    object
 13  jundice          701 non-null    int64 
 14  austim           701 non-null    int64 
 15  contry_of_res    701 non-null    object
 16  used_app_before  701 non-null    int64 
 17  resultnumeric    701 non-null    in

    # : number of functions in the data framework
    Column: Features header in the Dataframe
    Non-null Count: Counter of nonzero values for each Dataframe function
    Type: type of data stored for each function of the data frame

## Summary

In [51]:
df.describe() #Statistical summary of DataFrame

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,jundice,austim,used_app_before,resultnumeric,Class/ASD
count,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0,701.0
mean,0.723252,0.452211,0.457917,0.496434,0.499287,0.285307,0.416548,0.650499,0.32525,0.574893,29.703281,0.477889,0.098431,0.129815,0.017118,4.881598,0.269615
std,0.44771,0.498066,0.498582,0.500344,0.500357,0.451883,0.493339,0.477153,0.468803,0.494712,16.51866,0.499868,0.298109,0.336339,0.129805,2.499478,0.444077
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0,0.0,3.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,0.0,4.0,0.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35.0,1.0,0.0,0.0,0.0,7.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,383.0,1.0,1.0,1.0,1.0,10.0,1.0


    count: number of examples counted for the selected function
    mean: arithmetic mean for the selected function
    std: standard deviation for the selected function
    min: minimum value presented by the examples for the selected function
    25%: first quartile calculated on the examples for the selected function
    50%: second quartile calculated on the examples for the selected function
    75%: third quartile calculated on examples for selected feature
    max: maximum value presented by the examples for the selected function

In [52]:
df.head()

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,resultnumeric,age_desc,relation,Class/ASD
0,1,1,0,1,0,0,0,1,0,1,...,0,Latino,0,1,Brazil,0,5,'18 and more',Self,0
1,1,1,0,1,1,0,1,1,1,1,...,0,Latino,1,1,Spain,0,8,'18 and more',Parent,1
2,1,1,0,1,0,0,1,1,0,1,...,1,White-European,0,1,'United States',0,6,'18 and more',Self,0
3,1,0,0,0,0,0,0,1,0,0,...,1,Others,0,0,Egypt,0,2,'18 and more',Self,0
4,1,1,1,1,1,0,1,1,1,1,...,0,Others,1,0,'United States',0,9,'18 and more',Self,1


## Visualization