# Predict whether or not a patient has diabetes, based on certain diagnostic measurements
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

- The datasets consists of several medical predictor variables and one target variable, Outcome. 
- Predictor variables includes:
    - the number of pregnancies the patient has had, 
    - their BMI, 
    - insulin level, 
    - age, and so on.

## Import the necessary packages

In [None]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Load and preprocess the data

In [None]:
data = pd.read_csv('prima-indians-diabetes.csv', header=None)

In [None]:
data.head()

In [None]:
X = data.iloc[:,:8]
Y = data.iloc[:,8:]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

## Train a classifer/Build diabetes prediction model

In [None]:
# Train a model using NB
clf = GaussianNB()

In [None]:
clf.fit(x_train, y_train.values.ravel())

## Make predictions using test samples

In [None]:
# Make predictions on tets data
predictions = clf.predict(x_test)

In [None]:
predictions

## Evaluate the model

In [None]:
# Evaluate the accuracy
print('Accuracy Score: ', accuracy_score(predictions, y_test))

## EXERCISE: Perform k-fold cross validation and use other model evalution metrics to evaluate the model.