# Descriptive Statistics

## Pima Indians Dataset
This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within five years.

We are going to use the pandas library for loading the data (which is in CSV).

In [1]:
import pandas as pd

We are going to use the describe() function on the Pandas DataFrame

In [2]:
# View first 20 rows
filename = "../../datasets/pima_indians_diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)
pd.set_option('precision', 3)
df.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845,120.895,69.105,20.536,79.799,31.993,0.472,33.241,0.349
std,3.37,31.973,19.356,15.952,115.244,7.884,0.331,11.76,0.477
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.372,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.626,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### describe() returns 8 statistical properties for each attribute
* Count: how many values the attribute has
* Mean: the mean of all values of the attribute 
* Standard Deviation: the amount of variation values have. Low $\sigma$ means values are closer to the mean
* Minimum value
* 25<sup>th</sup> Percentile: the value under which 25% of the instances can be found
* 50<sup>th</sup> Percentile: the value under which 50% of the instances can be found (also called the median, is the value that splits the distribution in half)
* 75<sup>th</sup> Percentile: the value under which 75% of the instances can be found
* Maximum value

### Class Distribution
Summarise distribution of instances across classes

On classification problems you need to know how balanced the class values are. Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project.

In [3]:
df.groupby('class').size()

class
0    500
1    268
dtype: int64

### Correlation between attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

![correlation](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0)


In [4]:
df.corr(method='pearson')

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
preg,1.0,0.129,0.141,-0.082,-0.074,0.018,-0.034,0.544,0.222
plas,0.129,1.0,0.153,0.057,0.331,0.221,0.137,0.264,0.467
pres,0.141,0.153,1.0,0.207,0.089,0.282,0.041,0.24,0.065
skin,-0.082,0.057,0.207,1.0,0.437,0.393,0.184,-0.114,0.075
test,-0.074,0.331,0.089,0.437,1.0,0.198,0.185,-0.042,0.131
mass,0.018,0.221,0.282,0.393,0.198,1.0,0.141,0.036,0.293
pedi,-0.034,0.137,0.041,0.184,0.185,0.141,1.0,0.034,0.174
age,0.544,0.264,0.24,-0.114,-0.042,0.036,0.034,1.0,0.238
class,0.222,0.467,0.065,0.075,0.131,0.293,0.174,0.238,1.0


### Skew of Univariate Distributions

In [5]:
df.skew()

preg     0.902
plas     0.174
pres    -1.844
skin     0.109
test     2.272
mass    -0.429
pedi     1.920
age      1.130
class    0.635
dtype: float64

## Wine Quality Dataset
This dataset contains instances for red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
between 0 (very bad) and 10 (very excellent).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

We are going to use the describe() function on the Pandas DataFrame

In [6]:
df = pd.read_csv('../../datasets/winequality-red.csv', sep=';')
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.32,0.528,0.271,2.539,0.087,15.875,46.468,0.997,3.311,0.658,10.423,5.636
std,1.741,0.179,0.195,1.41,0.047,10.46,32.895,0.002,0.154,0.17,1.066,0.808
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.996,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.997,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.998,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.004,4.01,2.0,14.9,8.0


### describe() returns 8 statistical properties for each attribute
* Count: how many values the attribute has
* Mean: the mean of all values of the attribute 
* Standard Deviation: the amount of variation values have. Low $\sigma$ means values are closer to the mean
* Minimum value
* 25<sup>th</sup> Percentile: the value under which 25% of the instances can be found
* 50<sup>th</sup> Percentile: the value under which 50% of the instances can be found (also called the median, is the value that splits the distribution in half)
* 75<sup>th</sup> Percentile: the value under which 75% of the instances can be found
* Maximum value

### Class Distribution
Summarise distribution of instances across classes

On classification problems you need to know how balanced the class values are. Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project.

In [7]:
df.groupby('quality').size()

quality
3     10
4     53
5    681
6    638
7    199
8     18
dtype: int64

### Correlation between attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

![correlation](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0)

In [8]:
df.corr(method='pearson')

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.256,0.672,0.115,0.094,-0.154,-0.113,0.668,-0.683,0.183,-0.062,0.124
volatile acidity,-0.256,1.0,-0.552,0.002,0.061,-0.011,0.076,0.022,0.235,-0.261,-0.202,-0.391
citric acid,0.672,-0.552,1.0,0.144,0.204,-0.061,0.036,0.365,-0.542,0.313,0.11,0.226
residual sugar,0.115,0.002,0.144,1.0,0.056,0.187,0.203,0.355,-0.086,0.006,0.042,0.014
chlorides,0.094,0.061,0.204,0.056,1.0,0.006,0.047,0.201,-0.265,0.371,-0.221,-0.129
free sulfur dioxide,-0.154,-0.011,-0.061,0.187,0.006,1.0,0.668,-0.022,0.07,0.052,-0.069,-0.051
total sulfur dioxide,-0.113,0.076,0.036,0.203,0.047,0.668,1.0,0.071,-0.066,0.043,-0.206,-0.185
density,0.668,0.022,0.365,0.355,0.201,-0.022,0.071,1.0,-0.342,0.149,-0.496,-0.175
pH,-0.683,0.235,-0.542,-0.086,-0.265,0.07,-0.066,-0.342,1.0,-0.197,0.206,-0.058
sulphates,0.183,-0.261,0.313,0.006,0.371,0.052,0.043,0.149,-0.197,1.0,0.094,0.251


### Skew of Univariate Distributions

In [9]:
df.skew()

fixed acidity           0.983
volatile acidity        0.672
citric acid             0.318
residual sugar          4.541
chlorides               5.680
free sulfur dioxide     1.251
total sulfur dioxide    1.516
density                 0.071
pH                      0.194
sulphates               2.429
alcohol                 0.861
quality                 0.218
dtype: float64