Finding Data
In my initial search to find a dataset of interest I continuely ran into aggregated data.  So I changed my search criteria to 'ML datasets' and discovered several repositories of data suitable for this project.  The dataset I've chosen deals with adult Autistic Spectrum Disorder Screening.  There are 704 samples of adults aged 17-61.  The features include responses to AQ-10 test along with other family history and demographical information.  The data was already split into feature and target sets as seen in the code below.  

In [None]:
!pip install ucimlrepo

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
autism_screening_adult = fetch_ucirepo(id=426) 
  
# data (as pandas dataframes) 
X = autism_screening_adult.data.features 
y = autism_screening_adult.data.targets 
  
# metadata 
print(autism_screening_adult.metadata) 
  
# variable information 
print(autism_screening_adult.variables) 


Cleaning the Data
From the initial variables output I could see that there were missing values in the age, ethnicity and relation features.  To get a better idea of what the dataset looked like, I printed out the unique responses for each category as seen below.

In [None]:
for c in X.columns[:]:
    print(c, X[c].unique())

As seen above:
-AQ-10 responses: Binary, appear clean.
-age: Integer, Obvious outlier/miskey (383) and NaN.
-gender: Binary, M/F that can be changed to 1/0.  
-ethnicity: Categorical, NaN, 2 different others responses that can be combined.
-jaundice: Binary, Y/N that can be changed to 1/0.
-family_pdd: Binary, Y/N that can be changed to 1/0.
-country_of_res: Categorical, Appears clean.
-used_app_before: Binary, Y/N that can be changed to 1/0. (Curious how this was considered predictor of Autism) 
-result: Integer, Appears clean.
-age_desc: Only one response that encompasses all but one of the ages.
-relation: NaN (Curious how this was considered predictor of Autism) 

To resolve the issues stated above, I ran the following the code.  A print out to see the changes is included.
 

In [None]:
X.jaundice.replace(('yes', 'no'), (1, 0), inplace=True)
X.family_pdd.replace(('yes', 'no'), (1, 0), inplace=True)
X.gender.replace(('m', 'f'), (1,0), inplace=True)
X.age.replace(383, 38, inplace=True)
X.ethnicity.replace('others', 'Others', inplace=True)
X['ethnicity'].fillna('Others', inplace=True)
X.drop(['age_desc', 'relation', 'used_app_before'], axis=1, inplace=True)
mean_value = int(X['age'].mean())
X['age'].fillna(value=mean_value, inplace=True) 
for c in X.columns[:]:
    print(c, X[c].unique())

Label Encoding Categorical Data
Both ethnicity and country_of_res features are categorical data that will be encoded to be handled in models.
ohe = OneHotEncoder()
encoded_data = ohe.fit_transform(X[['ethnicity']])
X_encoded = pd.DataFrame(encoded_data.toarray(), columns=ohe.get_feature_names_out(['ethnicity']))
X = pd.concat([X.drop('ethnicity', axis=1), X_encoded], axis=1)
encoded_data = ohe.fit_transform(X[['country_of_res']])
X_encoded = pd.DataFrame(encoded_data.toarray(), columns=ohe.get_feature_names_out(['country_of_res']))
X = pd.concat([X.drop('country_of_res', axis=1), X_encoded], axis=1)

Creating a Training and Testing Dataset
The data provided was already split between features (X) and target (y) data.  I will split the next cell splits the data further into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Choosing a Model
Because my features are mostly binary or categorical with only 2 continuous variables (age, result), I will look at models that handle these types of features the best:  Logistic Regression,Random Forest, and SVM.

To have a baseline model, I will first create, train and evaluate a Random Forest model with 100 trees.

In [None]:
rfc = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train,y_train)
yhat = rfc.predict(X_test)
accuracy = accuracy_score(y_test, yhat)
print('Accuracy:', accuracy)
