# Understand Your Data With Descriptive Statistics

In this short tutorial, we will focus on:

1. Taking a peek at your data.
2. Review the dimensions of your dataset.
3. Review the data types of attributes in your data.
4. Summarize the distribution of instances across classes in your dataset.
5. Summarize your data using descriptive statistics.
6. Understand the relationships in your data using correlations.
7. Review the skew of the distributions of each attribute.

To follow the tutorial, you can download the data from Kaggle: [Pima Indians Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

## 1. Taking a peek at the data

There is no substitute for looking at the raw data. Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better pre-process and handle the data for machine learning tasks. You can review the first 20 rows of your data using the head() function on the Pandas DataFrame.

In [1]:
# Start by importing pandas
import pandas as pd

In [3]:
# Load your data
data = pd.read_csv("data/diabetes.csv")

In [4]:
# View first 20 rows
data.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


You can see that the first column lists the row number, which is handy for referencing a specific observation.

## 2. Review the dimensions of your dataset

You must have a very good handle on how much data you have, both in terms of rows and columns.

- Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
- Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.

In [12]:
shape = data.shape

In [6]:
shape

(768, 9)

In [7]:
rows, columns = data.shape

In [8]:
rows

768

In [9]:
columns

9

The results are listed in rows then columns. You can see that the dataset has 768 rows and
9 columns.

## 3. Review the data types of attributes in your data

The type of each attribute is important. Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.

In [11]:
data.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

You can see that most of the attributes are integers and that BMI and DiabetesPedigreeFunction are floating point types.