# Exploratory Data Analysis on Pima Indians Diabetes Dataset

This notebook explores the Pima-indians-diabetes dataset.

## Overview

Pima-indians-diabetes is a dataset about predicting the onset of diabetes mellitus based on certain diagnostic measurement.
.Number of instances (rows): 768
.Number of attributes (columns): 9 (8 features + 1 target label)

In [11]:
import numpy as np
import pandas as pd

columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
           "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]

df = pd.read_csv('pima-indians-diabetes.data.csv', names=columns)

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("Shape of dataset: ",df.shape)
print("This is the first 5 rows of the dataset: ")
print(df.head())

Shape of dataset:  (768, 9)
This is the first 5 rows of the dataset: 
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1


### Meaning of columns

- **Pregnancies**: Number of times pregnant  

- **Glucose**: Plasma glucose concentration after 2 hours in an oral glucose tolerance test  

- **BloodPressure**: Diastolic blood pressure (mm Hg)  

- **SkinThickness**: Triceps skin fold thickness (mm)  

- **Insulin**: 2-Hour serum insulin (mu U/ml)  

- **BMI**: Body mass index (weight in kg/(height in m)^2)  

- **DiabetesPedigreeFunction**: Score representing likelihood of diabetes based on family history  

- **Age**: Age of the patient (years)  

- **Outcome**: Class variable (0 = no diabetes, 1 = diabetes)  


## Identification of variables and data types

The dataset contains 768 rows and 9 columns.  
From `df.info()`, we can see that all columns are numeric (7 are `int64`, 2 are `float64`), and there are no missing values reported.  

- **Predictor variables (8 numeric features):** Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.  
- **Target variable (1 binary categorical):** Outcome (0 = no diabetes, 1 = diabetes).  

Thus, the dataset consists mainly of numerical variables, with the target variable being binary.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
