# Descriptive Statistics

## Pima Indians Dataset
This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within five years.

We are going to use the pandas library for loading the data (which is in CSV).

In [None]:
import pandas as pd

We are going to use the describe() function on the Pandas DataFrame

In [None]:
# View first 20 rows
filename = "../data/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)
pd.set_option('precision', 3)
df.describe()

### describe() returns 8 statistical properties for each attribute
* Count: how many values the attribute has
* Mean: the mean of all values of the attribute 
* Standard Deviation: the amount of variation values have. Low $\sigma$ means values are closer to the mean
* Minimum value
* 25<sup>th</sup> Percentile: the value under which 25% of the instances can be found
* 50<sup>th</sup> Percentile: the value under which 50% of the instances can be found (also called the median, is the value that splits the distribution in half)
* 75<sup>th</sup> Percentile: the value under which 75% of the instances can be found
* Maximum value

### Class Distribution
Summarise distribution of instances across classes

On classification problems you need to know how balanced the class values are. Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project.

In [None]:
df.groupby('class').size()

### Correlation between attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

![correlation](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0)


In [None]:
df.corr(method='pearson')

### Skew of Univariate Distributions

In [None]:
df.skew()

## Wine Quality Dataset
This dataset contains instances for red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
between 0 (very bad) and 10 (very excellent).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

We are going to use the describe() function on the Pandas DataFrame

In [None]:
df = pd.read_csv('../data/winequality-red.csv', sep=';')
df.describe()

### describe() returns 8 statistical properties for each attribute
* Count: how many values the attribute has
* Mean: the mean of all values of the attribute 
* Standard Deviation: the amount of variation values have. Low $\sigma$ means values are closer to the mean
* Minimum value
* 25<sup>th</sup> Percentile: the value under which 25% of the instances can be found
* 50<sup>th</sup> Percentile: the value under which 50% of the instances can be found (also called the median, is the value that splits the distribution in half)
* 75<sup>th</sup> Percentile: the value under which 75% of the instances can be found
* Maximum value

### Class Distribution
Summarise distribution of instances across classes

On classification problems you need to know how balanced the class values are. Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project.

In [None]:
df.groupby('quality').size()

### Correlation between attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

![correlation](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0)

In [None]:
df.corr(method='pearson')

### Skew of Univariate Distributions

In [None]:
df.skew()