# **Logistic Regression - Autism Spectrum Disorder Children Traits**

## This project uses the Kaggle dataset 'ASD Children Traits', which looks a number of factors to determine whether a child has ASD traits or not.

## The data is used below to train a Logistic Regression model, which predicts future cases, based on the same factors.

In [94]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [95]:
df = pd.read_csv('project_2_data.csv')

## Getting a general picture of the data

In [97]:
df.head()

Unnamed: 0,CASE_NO_PATIENT'S,A1,A2,A3,A4,A5,A6,A7,A8,A9,...,Global developmental delay/intellectual disability,Social/Behavioural Issues,Childhood Autism Rating Scale,Anxiety_disorder,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who_completed_the_test,ASD_traits
0,1,0,0,0,0,0,0,1,1,0,...,Yes,Yes,1,Yes,F,middle eastern,Yes,No,Family Member,No
1,2,1,1,0,0,0,1,1,0,0,...,Yes,Yes,2,Yes,M,White European,Yes,No,Family Member,Yes
2,3,1,0,0,0,0,0,1,1,0,...,Yes,Yes,4,Yes,M,Middle Eastern,Yes,No,Family Member,Yes
3,4,1,1,1,1,1,1,1,1,1,...,Yes,Yes,2,Yes,M,Hispanic,No,No,Family Member,Yes
4,5,1,1,0,1,1,1,1,1,1,...,Yes,Yes,1,Yes,F,White European,No,No,Family Member,Yes


In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1985 entries, 0 to 1984
Data columns (total 28 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   CASE_NO_PATIENT'S                                   1985 non-null   int64  
 1   A1                                                  1985 non-null   int64  
 2   A2                                                  1985 non-null   int64  
 3   A3                                                  1985 non-null   int64  
 4   A4                                                  1985 non-null   int64  
 5   A5                                                  1985 non-null   int64  
 6   A6                                                  1985 non-null   int64  
 7   A7                                                  1985 non-null   int64  
 8   A8                                                  1985 non-null   int64  
 9

## Checking for null values in the dataframe and removing them

In [100]:
df.isnull().sum()

CASE_NO_PATIENT'S                                      0
A1                                                     0
A2                                                     0
A3                                                     0
A4                                                     0
A5                                                     0
A6                                                     0
A7                                                     0
A8                                                     0
A9                                                     0
A10_Autism_Spectrum_Quotient                           0
Social_Responsiveness_Scale                            9
Age_Years                                              0
Qchat_10_Score                                        39
Speech Delay/Language Disorder                         0
Learning disorder                                      0
Genetic_Disorders                                      0
Depression                     

In [101]:
df.dropna(inplace=True)

In [102]:
len(df)

1923

In [103]:
df.Age_Years.min()

1

In [104]:
df.Age_Years.max()

18

## Removing unnecessary columns

In [106]:
df.drop(["Who_completed_the_test", "CASE_NO_PATIENT'S"], axis=1, inplace=True)

## Transforming string data into dummy code

In [108]:
cols_to_encode = ['Speech Delay/Language Disorder', 'Learning disorder', 
                  'Genetic_Disorders', 'Depression', 'Global developmental delay/intellectual disability', 
                  'Social/Behavioural Issues', 'Anxiety_disorder', 'Jaundice', 'Family_mem_with_ASD', 'ASD_traits']
df[cols_to_encode] = df[cols_to_encode].replace({'Yes': 1, 'No': 0})

  df[cols_to_encode] = df[cols_to_encode].replace({'Yes': 1, 'No': 0})


In [109]:
df.Sex = [1 if value == 'F' else 0 for value in df.Sex]

## Encoding labels

In [111]:
le = LabelEncoder()
df['Ethnicity'] = le.fit_transform(df['Ethnicity'])

## Statistics

In [113]:
df.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10_Autism_Spectrum_Quotient,...,Depression,Global developmental delay/intellectual disability,Social/Behavioural Issues,Childhood Autism Rating Scale,Anxiety_disorder,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,ASD_traits
count,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,...,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0,1923.0
mean,0.302132,0.24181,0.215289,0.274571,0.279771,0.306292,0.347374,0.24649,0.26105,0.453458,...,0.536141,0.536141,0.536141,1.712949,0.534581,0.278211,7.220489,0.770671,0.314093,0.529381
std,0.459302,0.428291,0.411129,0.446414,0.449003,0.461073,0.47626,0.431079,0.439322,0.497959,...,0.498822,0.498822,0.498822,1.022013,0.498932,0.448235,4.922844,0.420511,0.464274,0.499266
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,0.0,10.0,1.0,0.0,1.0
75%,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,2.0,1.0,1.0,10.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,4.0,1.0,1.0,15.0,1.0,1.0,1.0


## Normalising data

In [127]:
scaled_df = df.copy()

In [129]:
scaler = StandardScaler()

In [131]:
scaled_df[['Childhood Autism Rating Scale', 'Qchat_10_Score', 'Age_Years', 'Social_Responsiveness_Scale']] = scaler.fit_transform(
    df[['Childhood Autism Rating Scale', 'Qchat_10_Score', 'Age_Years', 'Social_Responsiveness_Scale']])

## Splitting the dataset into training and testing data

In [133]:
y = scaled_df['ASD_traits'] 
X = scaled_df.drop(['ASD_traits'], axis=1)

In [135]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

## Checking for Multicollinearity

In [137]:
condition_num = np.linalg.cond(X)
print('Condition number: ', condition_num)

Condition number:  1.2071900008257816e+16


## Creating and fitting the model

In [139]:
lr = LogisticRegression(max_iter=500)

In [141]:
lr.fit(X_train, y_train)

## Predicting

In [143]:
y_pred = lr.predict(X_test)
y_pred

array([1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,

## Evaluating the model

In [145]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.96


In [147]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       266
           1       0.96      0.95      0.96       311

    accuracy                           0.96       577
   macro avg       0.96      0.96      0.96       577
weighted avg       0.96      0.96      0.96       577



# **Summary**

### A Logistic Regression model was bilt based on the results of 1923 participant tests, between the ages of 1 and 18 years, to predict whether a person has **Autism Spectrum Disorder traits** ('ASD_traits') or not.

### The model was trained on the following predictors:
<h3 style="color:purple">'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10_Autism_Spectrum_Quotient', 'Social_Responsiveness_Scale', 'Age_Years', 'Qchat_10_Score', 'Speech Delay/Language Disorder', 'Learning disorder', 'Genetic_Disorders', 'Depression', 'Global developmental delay/intellectual disability', 'Social/Behavioural Issues', 'Childhood Autism Rating Scale', 'Anxiety_disorder', 'Sex', 'Ethnicity', 'Jaundice', 'Family_mem_with_ASD', 'Who_completed_the_test'</h3>

### *(where A1-A10 indicate Autism Spectrum Quotient)*

### The model has an **accuracy rate of 96%**, which is high enough to be defined as trust-worthy.