# Datasets and summary statistics

First, we import the necessary libraries:

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.datasets as dsets

Next, we load the data:

In [3]:
# First, we load one of the datasets from the sklearn.datasets package
dataset = dsets.load_wine()

# The description tells us a lot already, but this will not always be available of course
print(dataset['DESCR'])

# Let's create a dataframe for the independent variables
wine_data = pd.DataFrame(data=dataset['data'],columns=dataset['feature_names'])

# Now, we create a dataframe for the independent variable
wine_classes = pd.DataFrame(data=dataset['target'],columns=['target'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

## Summary statistics

Let's look into the summary statistics of some variables:

In [4]:
# Calculating the mean and median is built-in in Pandas and sklearn
# You can simply use the following functions (notice how round() is used)

mean = np.mean(wine_data['color_intensity'])
median = np.median(wine_data['color_intensity'])
print('Mean: '+str(round(mean,2)))
print('Median: '+str(round(median,2)))

# Or the following ones:
mean = wine_data['color_intensity'].mean()
median = wine_data['color_intensity'].median()
std = wine_data['color_intensity'].std()

print('Mean: '+str(round(mean,2)))
print('Median: '+str(round(median,2)))
print('Standard deviation: '+str(round(std,2)))

Mean: 5.06
Median: 4.69
Mean: 5.06
Median: 4.69
Standard deviation: 2.32


In [5]:
# There is also the built-in describe function for both single and multiple variables
print(wine_data['color_intensity'].describe())

count    178.000000
mean       5.058090
std        2.318286
min        1.280000
25%        3.220000
50%        4.690000
75%        6.200000
max       13.000000
Name: color_intensity, dtype: float64


In [6]:
print(wine_data.describe())

          alcohol  malic_acid         ash  alcalinity_of_ash   magnesium  \
count  178.000000  178.000000  178.000000         178.000000  178.000000   
mean    13.000618    2.336348    2.366517          19.494944   99.741573   
std      0.811827    1.117146    0.274344           3.339564   14.282484   
min     11.030000    0.740000    1.360000          10.600000   70.000000   
25%     12.362500    1.602500    2.210000          17.200000   88.000000   
50%     13.050000    1.865000    2.360000          19.500000   98.000000   
75%     13.677500    3.082500    2.557500          21.500000  107.000000   
max     14.830000    5.800000    3.230000          30.000000  162.000000   

       total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
count     178.000000  178.000000            178.000000       178.000000   
mean        2.295112    2.029270              0.361854         1.590899   
std         0.625851    0.998859              0.124453         0.572359   
min         0.9

For categorical values, we use frequency tables:

In [7]:
print(wine_classes['target'].value_counts())

1    71
0    59
2    48
Name: target, dtype: int64
