# Diabetes Prediction

**The objective is to predict based on diagnostic measurements whether a patient has diabetes. <br>
Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.**

#### Table of contents:
   - **Pregnancies**: Number of times pregnant
   - **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   - **BloodPressure**: Diastolic blood pressure (mm Hg)
   - **SkinThickness**: Triceps skin fold thickness (mm)
   - **Insulin**: 2-Hour serum insulin (mu U/ml)
   - **BMI**: Body mass index (weight in kg/(height in m)^2)
   - **DiabetesPedigreeFunction**: Diabetes pedigree function
   - **Age**: Age (years)
   - **Outcome**: Class variable (0 or 1)

Work plan:
1. [Study of general information.](#id1)
2. [Data preprocessing.](#id2)
3. [Exploratory data analysis.](#id3)
4. [Model building .](#id4)
5. [General conclusion.](#id5)

In [1]:
import pandas as pd
import numpy as np

from collections import Counter

import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve, auc, roc_auc_score
from sklearn.metrics import f1_score, precision_score, recall_score

import warnings
warnings.filterwarnings('ignore')

<a id="id1"></a>
## 1. Study of general information

In [2]:
data = pd.read_csv('diabetes.csv')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
data.duplicated().sum()

0

In [12]:
unique_number = data.nunique() \
                .reset_index() \
                .rename(columns={'index': 'Feature', 0: 'Count'})
fig = px.bar(unique_number, x='Count', y='Feature', 
             text='Count', template='plotly_dark',
             title='<b>Number of unique values<b>', height=800)
fig.update_traces(textposition='outside')
fig.update_layout(font_size=20,
                  font_family="San Serif")
fig.show()

In [40]:
count_target = data.Outcome \
    .value_counts() \
    .reset_index() \
    .rename(columns={'index': 'Diabetes', 'Outcome': 'Count'})
count_target['Diabetes'] = count_target['Diabetes'].map({0: 'No', 1: 'Yes'})

fig = px.pie(count_target, values='Count', names='Diabetes')
fig.update_layout(title='Distribution of the target variable',
                  font_family="San Serif",
                  font_size=24,
                  template='ggplot2', 
                  paper_bgcolor='lightgray')

fig.show()

<a id="id2"></a>
## 2. Data preprocessing