# The ABIDE dataset

Nowadays the Autism Spectrum Disorder is an hot topic in research.
The fondamental issue to be assessed is the absence of a Gold Standard methodology for diagnosis evaluation: based on clinical interviews and behavioral assessments.

In order to improve knowledge and discover rules behind ASD, ABIDE Dataset has been borned.

The ABIDE Dataset is a collaborative dataset, multi-sites, containing several kinds of informations: fMRI, sMRi, phenotypic, diagnosis.

This project has been focused on Phenotypic characteristics dealing with the fondamental issue mentioned above.

The ABIDE Dataset has been validated in several papers.
Our reference and starting poin is available here:
https://pubmed.ncbi.nlm.nih.gov/30393630/


In [None]:
import os #to interact to the file system
import numpy as np #Statistics
import pandas as pd #Database Technology <-> Data preproc & Data Analysis
from matplotlib import pyplot as plt #Visualization
import seaborn as sns #Visualization
import missingno as msno
from sklearn.preprocessing import RobustScaler #scikit-learn -> ML
import OurFunctions as of #saperated collection
import Optimization as opt

#seed for random processes
seed = 42
np.random.seed(seed)

## Load the dataset

In [None]:
ASD_phenotypic_original = pd.read_csv(os.path.join('DataSets','Phenotypic Datasets','ASD_phenotypic.csv'))

# DATA EXPLORATION

Visualization of the overall dataset

In [None]:
ASD_phenotypic_original

The dataset contains 1112 subjects and has 74 features.

 At first look, it is noticeable the presence of categorical and numerical features and missing values. Moreover, we can see the presence of the categories DX_GROUP and DSM_IV_TR, that are described from the ABIDE dataset legend as diagnostic, so we will further remove them from the dataset for the model predictor construction. 
 We can also proceed deleting the feature EYE_STATUS_AT_SCAN, because it deals with the eyes opening by the subject  during the fMRI, thus it is meaningless for our purposes.

DX_GROUP and DSM_IV_TR are our targets.
DX_GROUP contains info about ASD detection (yes/no). 
DSM_IV_TR specifyes which kind of autism, if there is.
However, in our investigation, we are interested only in evaluating the presence or not of the disorder, so we will not consider the information in DSM_IV_TR.

In order to understand the balancing of our dataset, we check the target feature: 'DX_GROUP'. 

1 | Autism detected

2 | Controls

In [None]:
of.evaluate_balancing(ASD_phenotypic_original)

DX_GROUP is balanced enought.

Now, to have a better view of the information contained in the dataset we display the names of the features and the respective types and quantity of non-null values.

In [None]:
ASD_phenotypic_original.info()

In [None]:
ASD_phenotypic_original.describe(include = 'object').T

As we can see, there is a huge amount of missing values. 

In order to have a better view of the distribution of the null values, we check the presence of missing values catalogated as None or numpy.NaN

In [None]:
# Count of the missing values

ASD_phenotypic_original, percent_missing = of.count_missing_value(ASD_phenotypic_original)

# We implemented a function "select_columns", that is able to define wich columns are numerical
# and which ones are categorical (also redefine the objects as categorical in the dataset)
numeric_columns, categorical_columns, ASD_phenotypic_original = of.select_columns(ASD_phenotypic_original)

# We plot the distribution of missing values, with the specification of numeric and categorical columns
of.plot_missing_values(percent_missing, numeric_columns, legend=True)



For the majority of the features the amount of missing values is not depreciable, so we can say that the information that is stored in the feature is not enough to create a reliable classier/cluster algortihm based on it. The same comment can be done for the subjects. We need to work on a dataset that has a maximum of 10% of missing values per feature, so we need to clean this in some way.

Now let's have a look on the general statistics for the numerical attributes.

In [None]:
ASD_phenotypic_original.describe()

We can notice the presence of "-9999" as minimum value for different features, a value that is commonly used to denote missing data or values out of range, so it should be better to consider them as NaN.

Now that we have an overall view of the dataset, let's start to work on it in order to clean it.

# DATA CLEANING

First of all, we decided to apart the feature DX_GROUP that give the diagnosis of the subjects, as our control label. As written above, DSM_IV_TR it is irrelevant for our purpose.
The same for EYE_STATUS_AT_SCAN. These two features are dropped.

In [None]:


ASD_phenotypic = ASD_phenotypic_original.drop(columns=['DSM_IV_TR','EYE_STATUS_AT_SCAN'])



Also we decide to drop SUB_ID, as it only store the information about the ID of the subject. But before doing this, we check if there aren't replicated subjects. Then, since we don't find any duplicate, we simply drop the column.

In [None]:
# Check if there are duplicate values in the 'SUB_ID' column
duplicate_ids = ASD_phenotypic['SUB_ID'].duplicated(keep=False)

# Get the unique duplicate IDs
unique_duplicate_ids = ASD_phenotypic.loc[duplicate_ids, 'SUB_ID'].unique()

#Drop column if there aren't duplicates
if len(unique_duplicate_ids) == 0:
    ASD_phenotypic = ASD_phenotypic.drop(columns=['SUB_ID'])
    print("SUB_ID has been dropped")
else:
    print("There are replicated values:" + str(unique_duplicate_ids))
    

In order to have a real count of the amount of missing values per feature, we change the -9999 values present in the overall data to np.NaN (we can make this because we know from the datasheets that -9999 is a value out of range for all the features).

In [None]:
for column in ASD_phenotypic:
    
    # Replace -9999 and "-9999" with NaN
    ASD_phenotypic[column] = ASD_phenotypic[column].replace(['-9999', -9999], np.NaN)
    

We also notice that there are two features ADI_R_RSRCH_RELIABLE and ADOS_RSRCH_RELIABLE: these are relating the personal that took the tests, indicating if he/she is a trained professional or not.
We use this information to evaluate the reliability of the score-tests. In order to avoid the inclusion of not reliable informations, we decide to eliminate the subjects that has a 0 in both categories (not reliable), while if the value is 1 or missed, we maintain the subject.

In [None]:
# Create a mask for rows to keep
filter = (ASD_phenotypic['ADI_R_RSRCH_RELIABLE'] != 0) | (ASD_phenotypic['ADOS_RSRCH_RELIABLE'] != 0)

# Calculate the number of subjects to delete
deleted_subjects = len(ASD_phenotypic) - filter.sum()

# Apply the mask to both DataFrames
ASD_phenotypic= ASD_phenotypic[filter]

# Print the number of subjects deleted
print("Number of subjects deleted:", deleted_subjects)

ASD_phenotypic = ASD_phenotypic.drop(columns=['ADI_R_RSRCH_RELIABLE','ADOS_RSRCH_RELIABLE'])

In [None]:
ASD_phenotypic

In order to complete the cleaning on our starting dataset:

- Displaying the distribution of missing values
- Menaging our dataset with the purpose to achieve at a cleaned one with two conditions:
    
    - 10 % of NaN per feature
    - at least 1/4 of subjects involved
    - balancing of dataset (max 70/30 %)

In [None]:
#Visualizzazione dei missing values
msno.matrix(ASD_phenotypic)

According to our aim, following the pipeline of reference paper, we create an alghoritm with boundary contitions already mentioned. Moreover, we force the presence of key features.  

In [None]:
# Key Features to mantain
key_features = ['FIQ', 'VIQ', 'PIQ', 'ADI_R_VERBAL_TOTAL_BV', 'ADOS_TOTAL']


# Clining function on dataset
ASD_phenotypic_cleaned = opt.remove_high_missing(ASD_phenotypic, key_features, balance_column='DX_GROUP', min_subjects=200, max_missing_percentage=10)

# Print DataFrame cleaned 
ASD_phenotypic_cleaned

Checking Missing Values for features and subjects

In [None]:
ASD_phenotypic_cleaned.isna().sum()


In [None]:
# Count of the missing values
ASD_phenotypic_cleaned, percent_missing = of.count_missing_value(ASD_phenotypic_cleaned)
numeric_columns, categorical_columns, ASD_phenotypic_cleaned = of.select_columns(ASD_phenotypic_cleaned)

of.plot_missing_values(percent_missing, numeric_columns, legend=True)



In [None]:
# Evaluation of missing value for subject instead of for feature
nan_values_per_subject = ASD_phenotypic_cleaned.T.isna().sum()

# Ording of missing values
subjects_with_nan_sorted = nan_values_per_subject.sort_values(ascending=False)


of.plot_missing_values(subjects_with_nan_sorted, nan_values_per_subject, legend=False)
plt.ylabel('Subjects')
plt.yticks([])
plt.show()

Checking the balancing

In [None]:
of.evaluate_balancing(ASD_phenotypic_cleaned)

The goal has been achieved in part. Since that we want to mantain ADI_R_VERBAL_TOTAL_BV', 'ADOS_TOTAL

In [None]:
# Detect features
features_to_check = ['ADI_R_VERBAL_TOTAL_BV', 'ADOS_TOTAL']

# Evaluate number of subjects with missing values for DX_GROUP
missing_counts = ASD_phenotypic_cleaned[features_to_check + ['DX_GROUP']].isna().groupby(ASD_phenotypic_cleaned['DX_GROUP']).sum()

# Evaluate total number of subjects for each DX_GROUP
total_counts = ASD_phenotypic_cleaned['DX_GROUP'].value_counts()

# Print results
missing_counts_df = pd.DataFrame({
    'Missing_ADI_R_VERBAL_TOTAL_BV': missing_counts['ADI_R_VERBAL_TOTAL_BV'],
    'Missing_ADOS_TOTAL': missing_counts['ADOS_TOTAL'],
    'Total_Subjects': total_counts
})

print(missing_counts_df)


In [None]:
# Filters out autistic subjects (DX_GROUP = 1) with missing values in the specified features
missing_autistic = ASD_phenotypic_cleaned[(ASD_phenotypic_cleaned['DX_GROUP'] == 1) & (ASD_phenotypic_cleaned[features_to_check].isna().any(axis=1))]

# Filters non-autistic subjects (DX_GROUP = 2) with missing values in the specified features
missing_non_autistic = ASD_phenotypic_cleaned[(ASD_phenotypic_cleaned['DX_GROUP'] == 2) & (ASD_phenotypic_cleaned[features_to_check].isna().any(axis=1))]
# Ottieni gli indici dei soggetti autistici con valori mancanti
indices_to_drop = missing_autistic.index

# Rimuovi i soggetti autistici dal DataFrame ASD_phenotypic_cleaned
ASD_phenotypic_cleaned = ASD_phenotypic_cleaned.drop(index=indices_to_drop)

In [None]:
of.evaluate_balancing(ASD_phenotypic_cleaned)

In [None]:

# Definisci il valore soglia del 25% per i valori mancanti
desired_missing_percentage = 25


cleaned_df = opt.optimization_rules(ASD_phenotypic_cleaned,features_to_check,desired_missing_percentage)



In [None]:
# Count of the missing values

cleaned_df,percent_missing = of.count_missing_value(cleaned_df)

numeric_columns, categorical_columns, cleaned_df = of.select_columns(cleaned_df)

of.plot_missing_values(percent_missing, numeric_columns, legend=True)



In [None]:
of.evaluate_balancing(cleaned_df)

In [None]:

# Store them in a new dataset called ASD_clinical
ASD_clinical = cleaned_df[['DX_GROUP']]
# Drop  columns DX_GROUP and storage it apart 
ASD_phenotypic = cleaned_df.drop(columns=['DX_GROUP'])



In [None]:
ASD_phenotypic.head().T

# Data Distribution

We want to understand how our features are distributed, in order to know which kind of normalization of the data is more suitable and if we may need to proceed with outlier detection in classification phase.

In [None]:
# Plot distribution of features
of.plot_distributions(ASD_phenotypic)

Disucussing graph's results:
We notice the presence of possible outliers per feature. Thus, we quantify the fraction in the entire dataset, in order to take in consideration this element in our future analysis (Outlier detection in classification). 

In [None]:
thresholds = {
    'AGE_AT_SCAN': [{'threshold': 30, 'rule': 'greater'}],
    'FIQ': [
        {'threshold': 145, 'rule': 'greater'},
        {'threshold': 70, 'rule': 'less'}
    ],
     'VIQ': [
        {'threshold': 145, 'rule': 'greater'},
        {'threshold': 70, 'rule': 'less'}
    ],
    
    'PIQ': [
        {'threshold': 140, 'rule': 'greater'},
        {'threshold': 70, 'rule': 'less'}
    ],
    'ADI_R_VERBAL_TOTAL_BV': [{'threshold': 6, 'rule': 'less'}]
}


In [None]:
# Count number of outliers per feature
outlier_counts = {}
total_samples = len(ASD_phenotypic)

for feature, rules in thresholds.items():
    outlier_count = 0
    for rule in rules:
        threshold = rule['threshold']
        comparison = rule['rule']
        if comparison == 'greater':
            outlier_count += (ASD_phenotypic[feature] > threshold).sum()
        elif comparison == 'less':
            outlier_count += (ASD_phenotypic[feature] < threshold).sum()
    outlier_counts[feature] = outlier_count

# % outliers per feature
outlier_percentages = {feature: (count / total_samples) * 100 for feature, count in outlier_counts.items()}

# Print results
for feature, percentage in outlier_percentages.items():
    print(f'Feature: {feature}, Outlier Percentage: {percentage:.2f}%')


In [None]:
outlier_counts

## CORRELATION ANALYSIS

In order to procced in our data exploration, we want to perform Correlation Analysis.

We need to normalize the data to make comparisons.

## Normalization

In [None]:

# Columns selection
numeric_columns = ASD_phenotypic.select_dtypes(include=['float64', 'int64'])
categorical_columns = ASD_phenotypic.select_dtypes(include=['object', 'category'])

# Inizialization of StandardScaler
scaler = RobustScaler()

# Fit transorm numerical data
scaled_numeric_data = scaler.fit_transform(numeric_columns)

# numerical features normalized
numeric_columns_normalized = pd.DataFrame(scaled_numeric_data, columns=numeric_columns.columns, index=ASD_phenotypic.index)

# New normalized DataFrame
ASD_phenotypic_normalized = pd.concat([numeric_columns_normalized, categorical_columns], axis=1)



In [None]:
ASD_phenotypic_normalized.describe()

In [None]:
ASD_phenotypic_normalized 

## Correlation between Numerical Features

In [None]:
numeric_normalized = ASD_phenotypic_normalized.select_dtypes(include=['float64', 'int64'])

# Correlation Matrix
correlation_matrix = numeric_normalized.corr()
print(correlation_matrix)

In [None]:
correlation_matrix = numeric_normalized.corr()
numeric_normalized.T
f,ax = plt.subplots(figsize=(10, 8))
sns.heatmap(numeric_normalized.corr(), 
            annot=True, 
            linewidths=.5, 
            fmt= '.2f',
            ax=ax,
            vmin=-1, 
            vmax=1,
            cmap = "coolwarm")
plt.show()


## Correlation between Categorical Features

In [None]:
numeric_columns, categorical_columns, ASD_phenotypic_normalized = of.select_columns(ASD_phenotypic_normalized)
# Compute Cramer's V for every pair of features
cramer_v_scores = pd.DataFrame(index=categorical_columns, columns=categorical_columns)
for feature1 in categorical_columns:
    for feature2 in categorical_columns:
        cramer_v = of.cramers_v(ASD_phenotypic_normalized[feature1], ASD_phenotypic_normalized[feature2])
        cramer_v_scores.loc[feature1, feature2] = cramer_v

# Plot heatmap of Cramer's V scores
plt.figure(figsize=(10, 8))
sns.heatmap(cramer_v_scores.astype(float), 
            annot=True, 
            linewidths=.5, 
            fmt='.2f',
            cmap="coolwarm")
plt.title("Cramer's V scores")
plt.show()


It seems that the information contained in FIQ_TEST_TYPE, VIQ_TEST_TYPE and PIQ_TEST_TYPE is almost the same, so it's reasonable thinking to eliminate from our analysis some of them.
In particular, since the relation between them:
Test Type
- PIQ-FIQ = 1
- VIQ-FIQ = 1
- PIQ-VIQ = 1
Thus, we procced keeping in consideration only PIQ_Test_Type (Performance Intelligence Quotient)  also for the presence of Ravens test type, that has a significant use in ASD investigation. The others test_type cosidarated relevant are shared between them.

In [None]:
ASD_phenotypic_normalized = ASD_phenotypic_normalized.drop(columns=['VIQ_TEST_TYPE', 'FIQ_TEST_TYPE'])
ASD_phenotypic = ASD_phenotypic.drop(columns=['VIQ_TEST_TYPE', 'FIQ_TEST_TYPE'])


### Visual "correlation" between Categorical and Numerical Features

In [None]:

numeric_columns, _, _ = of.select_columns(ASD_phenotypic)

# Definire le due feature categoriche specifiche in ordine desiderato
categorical_columns = ['SITE_ID', 'PIQ_TEST_TYPE']

# Number of plots per row
plots_per_row = 4
num_plots = len(numeric_columns) * len(categorical_columns)
num_rows = (num_plots + plots_per_row - 1) // plots_per_row  # Calculate the number of rows needed

fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(20, num_rows * 5))

# Flatten axes for easy iteration
axes = axes.flatten()

plot_idx = 0
for cat_col in categorical_columns:
    for numeric_col in numeric_columns:
        ax = axes[plot_idx]
        
        sns.boxplot(x=cat_col, y=numeric_col, data=ASD_phenotypic,
                    order=ASD_phenotypic_normalized.groupby(cat_col, observed=False)[numeric_col].median().sort_values().index, ax=ax)
        
        # Highlight the median in red
        median = ASD_phenotypic.groupby(cat_col, observed=False)[numeric_col].median().sort_values()
        for i in range(len(median)):
            ax.plot(i, median.iloc[i], 'ro')
        
        ax.set_title(f'Boxplot of {numeric_col} by {cat_col}')
        ax.set_xlabel(cat_col)
        ax.set_ylabel(numeric_col)
        
        plot_idx += 1

# Remove any empty subplots
for i in range(plot_idx, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()


We notice that it could be possible dealing with a Site-Specific or Test-Specific Analysis.
However, at the moment, it is out of scope. We leave the possibility to proceed on these different lines in further investigations.
Thus, we may decide to drop the feature SITE_ID.

#  DATA PREPROCESSING

First of all, in order to avoid mismatch between attributes that are the same, but written with upper or lower characters, we decide to unify them.

In [None]:
#We make all the caracters upper for all the categorical features
category_columns_upper = ASD_phenotypic.select_dtypes(include='category').apply(lambda x: x.str.upper())

#We now modify them in the dataset
ASD_phenotypic[category_columns_upper.columns] = category_columns_upper

In [None]:
numeric_columns, categorical_columns, ASD_phenotypic = of.select_columns(ASD_phenotypic)


In [None]:
# We obtain the names of the features 
categorical_column_names = categorical_columns.tolist()
categorical_column_names

## Exploiting categorical features

### SITE_ID

SITE_ID refers to the place where the data have benne collected

In [None]:
# Accesso a una specifica colonna categorica utilizzando la lista di nomi
specific_category_column = ASD_phenotypic[categorical_column_names[0]].value_counts(dropna=False)
specific_category_column


There is data that has been collected from the same center that we decide to unify.

In [None]:
# We create a function to replace the categories for the indicated cases

def replace_categories(category):
    if "UCLA" in category:
        return "UCLA"
    if "LEUVEN" in category:
        return "LEUVEN"
    if "UM" in category:
        return "UM"
    else:
        return category

# Then we apply the replace function
ASD_phenotypic[categorical_column_names[0]] = ASD_phenotypic[categorical_column_names[0]].apply(replace_categories).astype('category')

# Now we check the new order
specific_category_column = ASD_phenotypic[categorical_column_names[0]].value_counts(dropna=False)
specific_category_column

### PIQ_TEST_TYPE 

PIQ_TEST_TYPE refer to the type of test that each center chose to use. 


In [None]:
specific_category_column = ASD_phenotypic[categorical_column_names[1]].value_counts(dropna=False)
specific_category_column

In [None]:
# We create a function to replace the categories for the indicated cases

def replace_categories(category):
    if pd.isna(category):  # Controlla se il valore è NaN
        return category  # Se è NaN, restituisci lo stesso valore
    if "WASI" in category:
        return "WASI"
    if "WISC" in category:
        return "WISC"
    if "WAIS" in category:
        return "WAIS"
    if "DAS" in category:
        return "DAS"
    if "HAWIK" in category:
        return "HAWIK"
    if "PPVT" in category:
        return "PPVT"
    if "RAVENS" in category:
        return "RAVENS"
   
    else:
        return category


ASD_phenotypic[categorical_column_names[1]] = ASD_phenotypic[categorical_column_names[1]].apply(replace_categories).astype('category')
specific_category_column = ASD_phenotypic[categorical_column_names[1]].value_counts(dropna=False)
print(specific_category_column)



## Managing Missing Values

Now we are going to fullfill the missing values for all the features, based on an analysis of the information delivered by each one.

### IQ Test Type

We use features FIQ, VIQ and PIQ in order to fill some values in PIQ-TEST-TYPE.
Since the presence of more missing values in "Type" features, we make a comparison for each couple of features. For instance: if for PIQ there is a value and for PIQ-TEST-TYPE there is a missing one, we fill it with the MODE.

In [None]:
# Features pairs to be checked 
feature_pairs = [
    ('FIQ_TEST_TYPE', 'FIQ'),
    ('PIQ_TEST_TYPE', 'PIQ'),
    ('VIQ_TEST_TYPE', 'VIQ')]

# Itering on each pair of features
for test_type_col, score_col in feature_pairs:
    if test_type_col not in ASD_phenotypic.columns:
        test_type_df = ASD_phenotypic_original
    else:
        test_type_df = ASD_phenotypic
    # Itering on each row DataFrame
    for index, row in test_type_df.iterrows():
        # Check if the value of 'test_type_col' is missed
        if pd.isnull(row[test_type_col]):
            # If the value of column 'score_col' is present
            if not pd.isnull(row[score_col]):
                # Calculate mode of 'test_type_col'
                mode_test_type = test_type_df[test_type_col].mode()[0]
                # Substituing 'test_type_col' missing value with the mode 
                test_type_df.at[index, test_type_col] = mode_test_type
            # If both values in 'test_type_col' e 'score_col' are missing
            elif pd.isnull(row[score_col]):
                # Verify if is "NOT_AVAILABLE" is already present between categories of columns
                if "NOT_AVAILABLE" not in test_type_df[test_type_col].cat.categories:
                    # Add "NOT_AVAILABLE" as new category
                    test_type_df[test_type_col] = test_type_df[test_type_col].cat.add_categories("NOT_AVAILABLE")
                # Assigning category 'NOT_AVAILABLE' to 'test_type_col'
                test_type_df.at[index, test_type_col] = 'NOT_AVAILABLE'


### Test Scores

##### Data Standarization

Before starting to fill the missing values, we note that as the data for the variables FIQ, VIQ, PIQ was obtained with different tests, there are also different scales for the scores to take into account. In this way, we prefer to apply a standardization so we have all the score on the same scale.


In [None]:
ASD_phenotypic

In [None]:
ASD_phenotypic_copy = ASD_phenotypic.copy()
ASD_phenotypic_copy['SUB_ID'] = ASD_phenotypic_original['SUB_ID']


In [None]:
#For FIQ, the score scale is between 30-170 if the test taken was "DAS", otherwise is 50-160.
#We will unify all the data to the larger scale, i.e. 50-160

present_subjects = set(ASD_phenotypic_copy['SUB_ID'])
#Dataset original but only with actual subjects (present in ASD_phenotypic)
filtered_original = ASD_phenotypic_original[ASD_phenotypic_original['SUB_ID'].isin(present_subjects)]


In [None]:
# We start defining the condition
condition = (filtered_original['FIQ_TEST_TYPE'] == 'DAS') | (ASD_phenotypic['FIQ'] < 50) | (ASD_phenotypic['FIQ'] > 160)

# Then we standarize the values dictated by the condition, to the new scale
ASD_phenotypic['FIQ'] = np.where(condition, 
                        (ASD_phenotypic['FIQ'] - 30) / (170 - 30) * (160 - 50) + 50, 
                        ASD_phenotypic['FIQ'])

In [None]:
#For VIQ, the score scale is between 31-169 if the test taken was "DAS", 
#between 36-164 if the test taken was "STANFORD",
 #between 40-160 if the test taken was "PPVT",  otherwise is 50-160.
#We will unify all the data to the more common used scale, i.e. 50-160

for i in ASD_phenotypic.index:
    test_type = filtered_original['VIQ_TEST_TYPE'][i]
    current_value = ASD_phenotypic['VIQ'][i]
    if (test_type == 'DAS') or (current_value <36) or (current_value > 164):
        ASD_phenotypic.loc[i, 'VIQ'] = (current_value - 31) / (169 - 31) * (160 - 50) + 50
    elif (test_type == 'STANFORD') or (current_value <40) or (current_value > 160):
        ASD_phenotypic.loc[i, 'VIQ'] = (current_value - 36) / (164 - 36) * (160 - 50) + 50
    elif (test_type == 'PPVT') or (current_value <50) or (current_value > 160):
        ASD_phenotypic.loc[i, 'VIQ'] = (current_value - 40) / (160 - 40) * (160 - 50) + 50


In [None]:
#For PIQ, the score scale is between 31-166 if the test taken was "DAS", 
#between 36-164 if the test taken was "STANFORD",
 #between 50-160 if the test taken was "RAVENS",  otherwise is 53-160.
#We will unify all the data to the more common used scale,, i.e. 50-160

for i in ASD_phenotypic.index:
    test_type = ASD_phenotypic['PIQ_TEST_TYPE'][i]
    current_value = ASD_phenotypic['PIQ'][i]
    if (test_type == 'DAS') or (current_value <36) or (current_value > 164):
        ASD_phenotypic.loc[i, 'VIQ'] = (current_value - 31) / (166 - 31) * (160 - 50) + 50
    elif (test_type == 'STANFORD') or (current_value <50) or (current_value > 160):
        ASD_phenotypic.loc[i, 'VIQ'] = (current_value - 36) / (164 - 36) * (160 - 50) + 50


In [None]:
ASD_phenotypic.describe()

### ADOS_TOTAL

The feature "ADOS_TOTAL" is simply the sum of the scores obtained by "ADOS_COMM" and "ADOS_SOCIAL", so we can reduce the amount of missing values using the values of those features.

In [None]:
# Replace -9999 and "-9999" with NaN
ASD_phenotypic_original["ADOS_COMM"] = ASD_phenotypic_original["ADOS_COMM"].replace(['-9999', -9999], np.NaN)
ASD_phenotypic_original["ADOS_SOCIAL"] = ASD_phenotypic_original["ADOS_SOCIAL"].replace(['-9999', -9999], np.NaN)

for i in ASD_phenotypic["ADOS_TOTAL"].index:
    ados_comm = ASD_phenotypic_original["ADOS_COMM"][i]
    ados_social = ASD_phenotypic_original["ADOS_SOCIAL"][i]
    if not pd.isna(ados_comm) and not pd.isna(ados_social):
        ASD_phenotypic.loc[i, "ADOS_TOTAL"] = ados_comm + ados_social
    

#### Test scores filling

We decided that to fill the missing values of the test subministred
it should be good to rely on the standard score achieved by the mean
of the global population (if the statistics are available in the literature) or the cutoff for the diagnostic of ASD,
otherwise we will use the mean extracted from our dataset.

So for the features "FIQ", "VIQ", "PIQ", "ADOS_TOTAL", "ADI_R_VERBAL_TOTAL_BV", we will apply a custom function that checks if there is an available value in literature for the mundial mean, otherwise assign the mean of the feature.


In [None]:
#list of features that we want to fill
test_score_fatures = ["FIQ", "VIQ", "PIQ", "ADOS_TOTAL", "ADI_R_VERBAL_TOTAL_BV"]

#function to fill with the global mean or the data feature mean
def test_score_fill (feature_value, feature_name, feature_mean):
    # We create a dictionary to store the literature mean scores
    literature_scores = {
    "FIQ": list(range(95, 100)), # EEUU, mean score retrieved from https://www.worlddata.info/iq-by-country.php
    "VIQ": list(range(95, 100)), # EEUU, mean score retrieved from https://www.worlddata.info/iq-by-country.php
    "PIQ": list(range(95, 100)), # EEUU, mean score retrieved from https://www.worlddata.info/iq-by-country.php
    "ADOS_TOTAL": list(range(6, 12)), # autism cutoff retrieved from https://www.researchgate.net/figure/ADOS-maximum-score-and-cut-off-points-for-ASD-15_tbl1_361212648
    "ADI_R_VERBAL_TOTAL_BV": list(range(7, 10)), # autism cutoff retrieved from https://www.researchgate.net/figure/Summary-statistics-for-ADI-R-domain-scores_tbl4_6709395
    }

    # Then we check which feature we obtained to decide if replace
    # using the value in the dictionary ot directly the mean of the data
    if pd.isna(feature_value):

        if feature_name in literature_scores:
            return np.random.choice(literature_scores[feature_name])
        else:
            return feature_mean
    else:

        return feature_value

#loop for filling the features   
for feature_name in test_score_fatures:
    feature_mean = ASD_phenotypic_original[feature_name].mean()
    ASD_phenotypic[feature_name] = ASD_phenotypic[feature_name].apply(test_score_fill, args=(feature_name, feature_mean))


In [None]:
ASD_phenotypic.describe()

We can see, that if we display the information of the dataset, we have no longer presence of missing values.

In [None]:

ASD_phenotypic.info()

In [None]:
ASD_phenotypic.isnull().sum()

# Data distribution

We want to check if filling Missing Values has affected the data distribution

In [None]:
# Utilizzo della funzione per plottare le distribuzioni delle features
of.plot_distributions(ASD_phenotypic)

At the end, after have worked on SITE_IT feature in order to have a better preparation for possible future works, we decide to drop it.

In [None]:
ASD_phenotypic = ASD_phenotypic.drop(columns=["SITE_ID"])

# PRE PROCESSED DATA STORAGE

In [None]:

# We decide to store in a file .csv the pre-processed dataset
ASD_phenotypic.to_csv('DataSets/Phenotypic Datasets/ASD_phenotypic_preprocessed.csv', index=False)
# And also the diagnostic groups
ASD_clinical.to_csv('DataSets/Phenotypic Datasets/ASD_clinical.csv', index=False)
