<a href="https://colab.research.google.com/github/Jaeger47/A.I-Seminar/blob/main/2_1_2_Understand_your_Data_with_Descriptive_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Seven (7) Recipes to better understand your Machine Learning Data
You must understand your data in order to get the best results. In this notebook, you will the 7 recipes that you can use in Python to better understand your data. By the end of this lesson, you will know how to:

## 1. Take a peek at your raw data.

Before we go to understanding the data, it is necessary to import the data first to Google Colab. Among all options, we will use the method `Uploading the data from Local File System`.

In [1]:
from google.colab import files

uploaded = files.upload()
filename = 'pima-indians-diabetes.csv'

Saving pima-indians-diabetes.csv to pima-indians-diabetes.csv


Looking at the raw data can reveal insights that you cannot get any other ways. It can also plant seeds that may later grow into ideas on how to better pre-process and handle the data for machine learning tasks.

In [2]:
import pandas as pd

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names, comment='#')  # I am using the `url` here.
peek = data.head(20)
print(peek)

    preg  plas  pres  skin  test  mass   pedi  age  class
0      6   148    72    35     0  33.6  0.627   50      1
1      1    85    66    29     0  26.6  0.351   31      0
2      8   183    64     0     0  23.3  0.672   32      1
3      1    89    66    23    94  28.1  0.167   21      0
4      0   137    40    35   168  43.1  2.288   33      1
5      5   116    74     0     0  25.6  0.201   30      0
6      3    78    50    32    88  31.0  0.248   26      1
7     10   115     0     0     0  35.3  0.134   29      0
8      2   197    70    45   543  30.5  0.158   53      1
9      8   125    96     0     0   0.0  0.232   54      1
10     4   110    92     0     0  37.6  0.191   30      0
11    10   168    74     0     0  38.0  0.537   34      1
12    10   139    80     0     0  27.1  1.441   57      0
13     1   189    60    23   846  30.1  0.398   59      1
14     5   166    72    19   175  25.8  0.587   51      1
15     7   100     0     0     0  30.0  0.484   32      1
16     0   118

## 2. Review the dimensions of your dataset.

You must have a very good handle on how much data you have, both in terms of rows and columns.


*   Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
*   Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

You can review the shape and size of your dataset by printing the `shape` property on the Pandas `DataFrame`.



In [3]:
print(data.shape)

(768, 9)


## 3. Review the data types of attributes in your data.

The type of each attribute is important. Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. You can get an idea of the types of attributes by peeking at the raw data, as above. You can list the data types used by the `DataFrame` to characterize each attribute using the `dtypes` property.

In [4]:
print(data.dtypes)

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object


## 4. Summarize the distribution of instances accross classes in your dataset. (Classification only)

On classification problems, you need to know how balanced the class values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project. You can quickly get an idea of the distribution of the class attribute in Pandas.

In [7]:
class_counts = data.groupby('class').size()
print(class_counts)

class
0    500
1    268
dtype: int64


## 5. Summarize your data using descriptive statistics.

Descriptive statistics can give you great insight into the shape of each attribute. Often you can create more summaries than you have time to review. The `describe()` function on the Pandas `DataFrame` lists 8 statistical properties of each attribute. They are:


*   Count.
*   Mean.
*   Standard Deviation.
*   Minimum Value.
*   25th Percentile.
*   50th Percentile (Median).
*   75th Percentile.
*   Maximum Value.

The `set_option()` function changes the precision of the numbers. This is to round up the numerical values with 2 decimal points. When describing your data this way, it is worth taking some time and reviewing observations from the results. This might include the presence of `NA` values for missing data or surprising distributions for attributes.



In [8]:
pd.set_option('precision', 2) 
print(data.describe())

         preg    plas    pres    skin    test    mass    pedi     age   class
count  768.00  768.00  768.00  768.00  768.00  768.00  768.00  768.00  768.00
mean     3.85  120.89   69.11   20.54   79.80   31.99    0.47   33.24    0.35
std      3.37   31.97   19.36   15.95  115.24    7.88    0.33   11.76    0.48
min      0.00    0.00    0.00    0.00    0.00    0.00    0.08   21.00    0.00
25%      1.00   99.00   62.00    0.00    0.00   27.30    0.24   24.00    0.00
50%      3.00  117.00   72.00   23.00   30.50   32.00    0.37   29.00    0.00
75%      6.00  140.25   80.00   32.00  127.25   36.60    0.63   41.00    1.00
max     17.00  199.00  122.00   99.00  846.00   67.10    2.42   81.00    1.00


## 6. Understand the relationships in your data using correlations.

**Correlation** refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is **Pearson's Correlation Coefficient**, that assumes a normal distribution of the attributes involved. A correlation of `-1` or `1` shows a full negative or positive correlation respectively. Whereas a value of `0` shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in your dataset. You can use the
`corr()` function on the Pandas `DataFrame` to calculate a correlation matrix.




**Why is it important to determine correlations between variables?**

The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison. 


**How can we deal with this problem?**


*   Eliminate one of the perfectly correlated features
*   Dimension Reduction 

In [9]:
correlations = data.corr(method='pearson')
print(correlations)

       preg  plas  pres  skin  test  mass  pedi   age  class
preg   1.00  0.13  0.14 -0.08 -0.07  0.02 -0.03  0.54   0.22
plas   0.13  1.00  0.15  0.06  0.33  0.22  0.14  0.26   0.47
pres   0.14  0.15  1.00  0.21  0.09  0.28  0.04  0.24   0.07
skin  -0.08  0.06  0.21  1.00  0.44  0.39  0.18 -0.11   0.07
test  -0.07  0.33  0.09  0.44  1.00  0.20  0.19 -0.04   0.13
mass   0.02  0.22  0.28  0.39  0.20  1.00  0.14  0.04   0.29
pedi  -0.03  0.14  0.04  0.18  0.19  0.14  1.00  0.03   0.17
age    0.54  0.26  0.24 -0.11 -0.04  0.04  0.03  1.00   0.24
class  0.22  0.47  0.07  0.07  0.13  0.29  0.17  0.24   1.00


## 7. Review the skew of the distributions of each attribute.

**Skew** refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models. You can calculate the skew of each attribute using the `skew()` function on the Pandas DataFrame.

The skew result shows a positive (right) or negative (left) skew. Values closer to zero show less skew. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero.




**Why is it important to determine the skewness of every variable?**


*   Linear models work on the assumption that the distribution of the independent variable and dependent variable are similar.
*   The tailed region in skewed data act as an outlier for the statistical model and we know that outliers adversely affect the model's performance.


**How can we deal with this problem?**


*   Power transformation
*   Log transformation
*   Exponential transformation


In [10]:
print(data.skew())

preg     0.90
plas     0.17
pres    -1.84
skin     0.11
test     2.27
mass    -0.43
pedi     1.92
age      1.13
class    0.64
dtype: float64
