<a href="https://colab.research.google.com/github/KarMarsten/Autism/blob/main/The_Future_of_Autism.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S R
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

DATA_SOURCE_MAPPING = 'autism-files:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F4512103%2F7740048%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240301%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240301T233906Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D858d7aa76cb01c06399330f9181c4aa5f840592d982df4fe970dbe32745aab967ed1acb356a70e4ac3be595493c8c609f5a52e92e2fe3f66d81a4446c98405d9a6afda8b66dc5b2c8048c5595237964d2f049bf138e9049efcedb5ac8e5414cc4e2655b956d8828ee333114179145096256fa44d2051c837814488a684cc2c077b74d7c7663f126d2ac8ab8347717dd2c954517b985af84cb5339f935547c30a1a01ad5d27993dca35bc1a531d0b1f4918f60729192ee3164693e88b0d2051a5113ff1ebff3613437fe2e109f2b3a10791292a770d5ac18daa5befdb563497c01549a67793ec0e7c5e2335666835930c59d1af5df8d5e2a996add87404129519'

KAGGLE_INPUT_PATH = '/kaggle/input'
KAGGLE_WORKING_PATH = '/kaggle/working'

system(paste0('sudo umount ', '/kaggle/input'))
system(paste0('sudo rmdir ', '/kaggle/input'))
system(paste0('sudo mkdir -p -- ', KAGGLE_INPUT_PATH), intern=TRUE)
system(paste0('sudo chmod 777 ', KAGGLE_INPUT_PATH), intern=TRUE)
system(
  paste0('sudo ln -sfn ', KAGGLE_INPUT_PATH,' ',file.path('..', 'input')),
  intern=TRUE)

system(paste0('sudo mkdir -p -- ', KAGGLE_WORKING_PATH), intern=TRUE)
system(paste0('sudo chmod 777 ', KAGGLE_WORKING_PATH), intern=TRUE)
system(
  paste0('sudo ln -sfn ', KAGGLE_WORKING_PATH, ' ', file.path('..', 'working')),
  intern=TRUE)

data_source_mappings = strsplit(DATA_SOURCE_MAPPING, ',')[[1]]
for (data_source_mapping in data_source_mappings) {
    path_and_url = strsplit(data_source_mapping, ':')
    directory = path_and_url[[1]][1]
    download_url = URLdecode(path_and_url[[1]][2])
    filename = sub("\\?.+", "", download_url)
    destination_path = file.path(KAGGLE_INPUT_PATH, directory)
    print(paste0('Downloading and uncompressing: ', directory))
    if (endsWith(filename, '.zip')){
      temp = tempfile(fileext = '.zip')
      download.file(download_url, temp)
      unzip(temp, overwrite = TRUE, exdir = destination_path)
      unlink(temp)
    }
    else{
      temp = tempfile(fileext = '.tar')
      download.file(download_url, temp)
      untar(temp, exdir = destination_path)
      unlink(temp)
    }
    print(paste0('Downloaded and uncompressed: ', directory))
}

print(paste0('Data source import complete'))


# <font color='plum'> Our Journey's Itinerary </font>

## <font color='plum'> The Main Questions </font>

* Are there currently trends happening in specific groups of children? Is there a prevalence increase in Autism?
* What will the population of autistic adults look like in 5 years?
* What should we as a society prepare ourselves for?

Those and *many* other questions are going to be addressed here.  This journey is going to begin in 2000 and bring us as close to a 5 year outlook as we can go! We will be using a few variables:
* Data collected by the CDC (specifically ADDM National Data)
* Lifespan data collected by the CDC (for both total Americans and the ASD community)

I will *also* be working a few known assumptions:
* This is based on data collected over time.  There have been groups of Americans either non- or under-represented
* I am using the 22 year age range as that transitional time when children become adults.
* There is not a single way that children are diagnosed.  Different regions use different ways to diagnose children.  

Some of these things seem obvious - others may not.  

## <font color='plum'> Method of Analysis </font>

Barring any unforeseen anomalies, the plan is to go through the following steps:
* Gather CDC data from 2000 to at least 2020.
* Determine if there is any sort of trending between ethnic or gender groups.
* Conclude and determine next steps



## <font color='plum'> Set up the environment </font>

In [None]:
# We have a few things that we need to load and gather before we start this off:
# Standard packages (tidyverse) should be installed
install.packages("tidyverse")
install.packages("janitor")

# There are several ways to get at the correct packages - but this will simplify:
library(tidyverse)
library(readr)
library(dplyr)
library(janitor)
library(tidyselect)
library(ggplot2)

We now have the tools we need to begin.  The data are going to be another item worth poking at.  I will be using data from data.gov as our primary source of our analysis for a national breakdown. A regional breakdown will be

The specific data that we are first loading is data.gov's [Autism and Developmental Disabilities Monitoring Network](https://www.cdc.gov/ncbddd/autism/data/assets/exceldata/ADV_AllData.xlsx).

In [None]:
# In order to load the csv into RStudio and get at autism specific data we have to locate the correct data from the ADDM
# After uploading the data file into RStudio, we can use the readxl package and use the read_excel function to bring the data into a dataframe
install.packages("readxl")
library(readxl)

autism_prev <- read_excel("/kaggle/input/autism-files/Prev.xlsx") #national biological sex data
autism_prev_ethnic <- read_excel("/kaggle/input/autism-files/prev_ethnic.xlsx") #national ethnic data data

# We now have specific data for national autism prevalence between 2000 and 2020
head(autism_prev)
head(autism_prev_ethnic)

Now that we have the basic data, we can clean it by making the column titles a bit more readable.

We have the following definitions of our variables:

## <font color='plum'> National Data </font>

**autism_prev**

*note - this dataset is missing values as the data were not collected during certain years or for certain subgroups*
* “year” variable: the year the data are reporting on
* “prevalence”: the frequency of autism per 1,000 children.  
* "biological_sex": whether the child was born as male or female

**autism_prev_ethic**

Biological sex and ethnic background are also accounted for in this dataset
* “year” variable: the year the data are reporting on
* “prevalence”: the frequency of autism per 1,000 children.  
* "ethnic_group": ethnicity of the child (listed at birth) <font color='plum'>**NOTE**</font> there is a possibility that this is a reason that we do not see "multiracial" children listed until much later in the timeline.


## <font color='plum'> Graphical representation </font>
From 2000 to 2020 we have data to graphically represent the change in prevalence for children based on biological sex.

In [None]:
# To differentiate between biological sex prevalence in children (Historical)
ggplot(data=autism_prev) +
  geom_point(mapping = aes(x=year,y=prevalence, color=biological_sex)) +
  geom_smooth(mapping = aes(x=year,y=prevalence, color=biological_sex), method=lm,na.rm = TRUE) +
  labs(title = "Autism Prevalence: 2000 through 2020", subtitle = "Difference between Biological Sexes", y="Prevalence", x="Year", caption = "Data collected by ADDM/CDC", color="Biological Sex")

**We also have data for various ethnic groups. Due to the limited quantity of data, proper trend analysis would be inconclusive at best (wrong at worst).  I will show the image but PLEASE remember this huge caveat.**

In [None]:
# To differentiate between ethnic group prevalence in children (Historical)
# due to the lack of data (for unrepresented groups) a smooth line was used to show exactly when data became available
ggplot(data=autism_prev_ethnic) +
  geom_point(mapping = aes(x=year,y=prevalence, color=ethnic_group)) +
  geom_smooth(mapping = aes(x=year,y=prevalence, color=ethnic_group), na.rm = TRUE) +
  labs(title = "Autism Prevalence: 2000 through 2020", subtitle = "Difference between Ethnic Groups", y="Prevalence", x="Year", caption = "Data collected by ADDM/CDC", color="Ethnic Groups")

## <font color='plum'> Decision point </font>
![image.png](attachment:f83376ff-25c8-4552-80a9-bbda283ac46e.png)

To move forward with national trend analysis - we are going to focus on trending based on biological sex rather than ethnic backgound.  There are too many data elements that could lead to incorrect conclusions when handling data based on ethnic background.  

<font color='plum'>**Conclusion 1**:
Not enough data collection has been done across all Americans, the data we do have could be biased. The next steps here would be to reach out to advocacy groups to collect more data. </font>

## <font color='plum'> Trend analysis based on biological sex </font>

In order to move forward, I will use the historical data to determine how best to gain a trend analysis.
Based on the previous data, there is a trend for both biological sexes.

The next step is to isolate the data between male and female

In [None]:
# Creating male and female specific data.frames
male_prev <- filter(autism_prev,biological_sex == "Male")
female_prev <- filter(autism_prev,biological_sex == "Female")

head(male_prev)
head(female_prev)

Due to data limitations, I am obligated to use the Random Walk Forecast function to get at the next 10 male and female values.

**<font color="red">THIS IS LESS THAN IDEAL</font>**

In [None]:
#install the forecast package
install.packages("forecast")
library(forecast)

# For Male estimates for the next 5 years:
rwf(male_prev$prevalence,h=5,drift=TRUE,level = c(80,95),fan = FALSE,lambda = NULL,biasadj = FALSE)

# For Female estimates for the next 5 years:
rwf(female_prev$prevalence,h=5,drift=TRUE,level = c(80,95),fan = FALSE,lambda = NULL,biasadj = FALSE)

In [None]:
#current and future data
autism_future <- read_excel("/kaggle/input/autism-files/Prev_future.xlsx")
head(autism_future)

#current and future chart
ggplot(data=autism_future) +
  geom_point(mapping = aes(x=year,y=prevalence, color=biological_sex)) +
  geom_smooth(mapping = aes(x=year,y=prevalence, color=biological_sex), method=lm) +
  labs(title = "Autism Prevalence: 2000 through 2025", y="Prevalence", x="Year", caption = "Data collected by ADDM/CDC", color="Biological Sex")

## <font color='plum'> Final Conclusion </font>
As seen with the data, there is a lot to be desired.  There is spotty data collection by the CDC (and only in 18 states of the US).  As of this initial stab at the questions, 1-36 8 year old children are diagnosed as Autistic.  In no way is this data collection project handling "why are there so many new diagnoses being made?" but in preparation for the future, this makes a statement.

Many states deem an adult 22 years old or older.  At the rate that autism diagnoses are growing, many of the children diagnosed in 2002 will be aging into adulthood.  While they will no longer be in a public school, there will be additional need for state/federal assistance.


### <font color='plum'> Next Steps </font>
* Reach out to advocacy groups or locate additional ethnic group based data to continue analysis. (Either nationally or regionally)
* Work with other advocates to gather additional data to clean and get a better sense of the upcoming amount of autistic adults.

Sources:
[The Autism and Developmental Disabilities Monitoring (ADDM) Network (CDC.gov)](https://www.cdc.gov/ncbddd/autism/data/index.html#data)