# Diabetes research. 

```
Data processing and model creation based on K-Nearest-Neighbours. 
The purpose of the study: to create a model that can make preliminary predictions about the presence and development of diabetes.
```

```
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. 
Several constraints were placed on the selection of these instances from a larger database. 
In particular, all patients here are females at least 21 years old of Pima Indian heritage.
```

In [3]:
import pandas as pd
import plotly

plotly.offline.init_notebook_mode(connected=True)

data = pd.read_csv('diabetes.csv')

print("Data Types:")
data.info()

print("\nFirst five observations:")
data.head()

Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

First five observations:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Now we have diabetes dataset for training prediction models. 
First let's look at the target distribution. 
This'll help us not get confused in interpreting models preforance evaluations. 

In [13]:
healthy = data[data['Outcome'] == 0]['Outcome']
diabetics = data[data['Outcome'] != 0]['Outcome']

def target_count():
    trace = plotly.graph_objs.Bar(
        x = data['Outcome'].value_counts().values.tolist(), 
        y = ['healthy','diabetic' ], 
        orientation = 'h', 
        text=data['Outcome'].value_counts().values.tolist(), 
        textfont=dict(size=15),
        textposition = 'auto',
        opacity = 1,
        marker=dict(
            color=['green', 'blue'],
            line=dict(color='#000000',width=1)
            )
        )

    layout = dict(title =  'Count of Outcome variable')

    fig = dict(data = [trace], layout=layout)
    plotly.offline.iplot(fig)

target_count()
