<h2><font color="#004D7F" size=6>Module 2. Data Analysis</font></h2>

<h1><font color="#004D7F" size=5>2. Descriptive Statistics</font></h1>

<h2><font color="#004D7F" size=5>Index</font></h2>
<a id="indice"></a>

* 1. Introduction
    * 1.1. Load the Dataset
* 2. Descriptive Statistics Functions
    * 2.1. Reviewing the Data: head()
    * 2.2. Data Dimensions: shape
    * 2.3. Data Types: dtypes
    * 2.4. Summary: describe()
    * 2.5. Class Distribution: groupby('class').size()
    * 2.6. Correlations: corr()
    * 2.7. Skewness: skew()

# <font color="#004D7F"> 1. Introduction</font>

One of the main mistakes made when starting to work with machine learning is making decisions directly through the algorithms without prior analysis of the dataset to work with. It is important to understand that, data analysis and good preprocessing of the data before performing modeling, will not only lead us to obtain better and more reliable results, but will also help us understand our dataset perfectly, providing a great experience over time in the analysis and data processing phase.

## <font color="#004D7F">1.1. Load the Dataset</font>

For this practice, we are going to load the Pima Indian Diabetes dataset to make observations with descriptive statistics functions.

In [2]:
import pandas as pd

filename = 'data/pima-indians-diabetes.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

data = pd.read_csv(filename, names=names)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,627.00,50,1
1,1,85,66,29,0,26.6,351.00,31,0
2,8,183,64,0,0,23.3,672.00,32,1
3,1,89,66,23,94,28.1,167.00,21,0
4,0,137,40,35,168,43.1,2288.00,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,171.00,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,245.00,30,0
766,1,126,60,0,0,30.1,349.00,47,1


## <font color="#004D7F">2.1. Reviewing the data: head() </font>

You can review the first 20 rows of your data using the head() function in the Pandas DataFrame. You can see that the first column lists the row number, which is useful for referencing a specific observation.

In [3]:
data.head(20)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,627.0,50,1
1,1,85,66,29,0,26.6,351.0,31,0
2,8,183,64,0,0,23.3,672.0,32,1
3,1,89,66,23,94,28.1,167.0,21,0
4,0,137,40,35,168,43.1,2288.0,33,1
5,5,116,74,0,0,25.6,201.0,30,0
6,3,78,50,32,88,31.0,248.0,26,1
7,10,115,0,0,0,35.3,134.0,29,0
8,2,197,70,45,543,30.5,158.0,53,1
9,8,125,96,0,0,0.0,232.0,54,1


## <font color="#004D7F">2.2. Data Dimensions: shape </font>

You can review the shape and size of your dataset by printing the shape property of the Pandas DataFrame. The results are listed in rows and then columns. You can see that the dataset has 768 rows and 9 columns.

In [5]:
# Dimensions of your data
data.shape

(768, 9)

## <font color="#004D7F">2.3. Data Types: dtypes </font>

You can enumerate the data types used by the DataFrame to characterize each attribute using the dtypes property. You can see that most attributes are integers and that `mass` and `pedi` are floating-point types.

In [6]:
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

## <font color="#004D7F">2.4. Summary: describe() </font>

You can see that you get a lot of data. You'll notice some calls like pandas.set_option() used to change the precision of numbers and the preferred width of the output. This is to make it more readable for this example. When describing your data this way, it's worth taking some time to review the observations from the results.

In [13]:
# Statistical Summary
pd.set_option('display.width',100)
pd.set_option('display.precision', 3) 

data.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845,120.895,69.105,20.536,79.799,31.993,428.235,33.241,0.349
std,3.37,31.973,19.356,15.952,115.244,7.884,340.486,11.76,0.477
min,0.0,0.0,0.0,0.0,0.0,0.0,0.1,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,205.0,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,337.0,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,591.5,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2329.0,81.0,1.0


## <font color="#004D7F">2.5. Class Distribution: groupby('class').size() </font>

You can see that there are almost twice as many observations with class 0 (no diabetes occurrence) than with class 1 (diabetes occurrence). In this case, we can observe that the classes are unbalanced, so we must analyze the algorithm results very carefully.

In [18]:
# Class Distribution
# Clasification problem

# This line of code groups the data by the 'class' column and then calculates the size of each group.
data.groupby('class').size()

class
0    500
1    268
dtype: int64

## <font color="#004D7F">2.6. Correlations: corr() </font>

You can use the `corr()` function to calculate a correlation matrix. The matrix lists all attributes across the top and side to give correlations between all pairs of attributes (twice, because the matrix is symmetric). You can see that the diagonal line across the matrix from the upper-left to lower-right corners of the matrix shows a perfect correlation of each attribute with itself.

In [20]:
# Pairwise Pearson correlation
pd.set_option('display.width',100)
pd.set_option('display.precision', 3) 

correlation = data.corr(method='pearson')
print(correlation)

        preg   plas   pres   skin   test   mass   pedi    age  class
preg   1.000  0.129  0.141 -0.082 -0.074  0.018 -0.026  0.544  0.222
plas   0.129  1.000  0.153  0.057  0.331  0.221  0.133  0.264  0.467
pres   0.141  0.153  1.000  0.207  0.089  0.282  0.051  0.240  0.065
skin  -0.082  0.057  0.207  1.000  0.437  0.393  0.154 -0.114  0.075
test  -0.074  0.331  0.089  0.437  1.000  0.198  0.185 -0.042  0.131
mass   0.018  0.221  0.282  0.393  0.198  1.000  0.104  0.036  0.293
pedi  -0.026  0.133  0.051  0.154  0.185  0.104  1.000  0.018  0.177
age    0.544  0.264  0.240 -0.114 -0.042  0.036  0.018  1.000  0.238
class  0.222  0.467  0.065  0.075  0.131  0.293  0.177  0.238  1.000


## <font color="#004D7F">2.7. Skewness: skew() </font>

You can calculate the skewness of each attribute using the skew() function. The skewness result shows either positive (right) or negative (left) skew. Values closer to zero indicate less skewness.

In [21]:
# Skew for each attribute
data.skew()

preg     0.902
plas     0.174
pres    -1.844
skin     0.109
test     2.272
mass    -0.429
pedi     1.562
age      1.130
class    0.635
dtype: float64