# Akimel O’odham and Diabetes

The Akimel O’odham, or Gila River People, who were noted for their technological prowess and agricultural skills, live in America. After the Americas was colonized by the British, and subsequently gained it's independence their diet changed significantly from their mostly farmed diet. This caused high levels of diabetes in a localised area, that prior studies hadn't found before this diet change.

In 1965, the Epidemiology and Field Studies Branch of the National Institute of Arthritis, Diabetes and Digestive and Kidney Diseases (catchy name, I know) in partnership with the Indian Health Service sent a team to begin an observational study of the Pima community at Gila River. The research lasted 40 years, and has been one of the most important bits of research involving diabetes.

In this study we will use their data to train a neural network, that can tell us given 8 inputs, to a probability of about 80~% if the person has Diabetes.

## Firstly we need to import our libraries

In [321]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

## Stochastic Models

Stochastic models use randomness (or probability based if you prefer) and often require multiple runs to produce results. We will be using a stochastic model in this, so to make life a bit easier, we're going to fix the seed at 7, so you always get the same results.

In [322]:
seed = 7
np.random.seed(seed)

## Lets have a sneaky peak at that data

This will all make sense a bit more if we can get a bit of a look at the features that we're working with here.

In [None]:
df = pd.read_csv("diabetes.csv")
df

Ok so far so good, lets pull this out into the result column (y) and the rest of the data (x), we're going to convert this to numpy, which is a vector based array calculation library that tensorflow understands.

These look like this when they are at home

In [None]:
y = df.Outcome.to_numpy()
x = df.drop(columns=["Outcome"]).to_numpy()
x

In [None]:
y

## Are we making our network too specific?
Since neural networks are trying to find a line of "best fit", one which predicts the trend within a dataset, we might brute force ourselves to a wrong answer, one that is so specific to the dataset that it doesn't work with anything else. This is called overfitting, and to avoid that we're going to reserve ⅓ of the data for testing, so we can be sure the neural net will work on data it hasn't seen before..

In [326]:
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.33, random_state=seed)

## Building up the network

There's a few things going on here, but what we are going to do is create a 3 layer neural net with 1 layer of 12, two layers of 8 and one layer of 1. These are a "Dense" layer (the most common), and they connect all other nodes in the layers above and below.

Each of these has an activation theshold. This is like a line that when reached, it will cause the perceptron (a single neuron) to fire, combining the inputs with an intial value. `relu` is the most common activation, and `random_uniform` is a very common initializer.


In [327]:
model = tf.keras.Sequential()
model.add(
    tf.keras.layers.Dense(
        12,
        kernel_initializer=tf.keras.initializers.random_uniform,
        activation=tf.keras.activations.relu
    )
)
model.add(
    tf.keras.layers.Dense(
        8,
        kernel_initializer=tf.keras.initializers.random_uniform,
        activation=tf.keras.activations.relu
    )
)
model.add(
    tf.keras.layers.Dense(
        8,
        kernel_initializer=tf.keras.initializers.random_uniform,
        activation=tf.keras.activations.relu
    )
)
model.add(
    tf.keras.layers.Dense(
        1,
        kernel_initializer=tf.keras.initializers.random_uniform,
        activation=tf.keras.activations.sigmoid
    )
)

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=[tf.keras.metrics.binary_accuracy]
)

## Fitting

This is the process of sending our training values into the neural network and recording the state of the neurons, more and more values are fired in until we get a initial values that produce somewhere close to our training sets target output values, in our case the column in `y`, which was called `"Outcome"` in the CSV. Epochs is the number of iterations to run, and batch size controls how fast the neural network will train, the bigger the batch size, the more memory you need, but the faster it will go. Your batch size should be in powers of 2 and fit into memory.

This adjustment is done with both `loss`, which is a measurement of error, and an `optimizer`, which adjusts that value and applies it for the next iteration to the perceptrons.

In [None]:
model.fit(train_x, train_y, epochs=500, batch_size=10)

## Now we see how well it worked

Well that's all well and good, but like, what does that give us. Lets feed our test data in and see.

In [None]:
scores = model.evaluate(test_x, test_y)
scores

Hmm, that's not super helpful, lets add some names.

In [None]:
print(f"{model.metrics_names[0]}: {scores[0] * 100}")
print(f"{model.metrics_names[1]}: {scores[1] * 100}%")

In [None]:
inputs = pd.DataFrame.from_dict(
    {
        "Pregnancies": [6],
        "Glucose": [148],
        "BloodPressure": [72],
        "SkinThickness": [35],
        "Insulin": [169.5],
        "BMI": [33.6],
        "DiabetesPedigreeFunction": [0.627],
        "Age": [50]
    }
)
inputs

In [None]:
predictions = model.predict(
    inputs.to_numpy()
)
print(f"{predictions[0][0]} probability that this person has diabetes")

## Following on

1. What other activation functions are available?
2. What other initialisation functions are available?
3. What other layers are available?
4. How do you feel about helping develop models to detect diabetes?