# <font color="blue">Lesson 2 - Data Retrieval and Preparation</font>

## Binning, Scaling and Normalization

### Binning the Continuous Variable into Bins

Let's say we have a dataset that contains subject ages. Rather than analyze the ages individually, we would like to group them into the following bins: 
- Under 20
- 20 to 40
- 40 to 60
- Over 60

We can use the pandas cut function to separate this list into bins. 

- x = list to cut  
- bins = number of equal sized bins to create
- right = Indicates whether the bins include the rightmost edge or not. If right == True (the default), then the bins [1,2,3,4] indicate (1,2], (2,3], (3,4]  
- labels = optional list of labels for each bin

Let's first create the data we need: 

In [1]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
bin_names = ['Youth', 'YoungAdult', 'MiddleAge', 'Senior']

Now we can use the cut function to cut our list into labeled bins: 

In [2]:
import pandas as pd
new_cats = pd.cut(ages, bins,labels=bin_names)

pd.value_counts(new_cats)

Youth         5
MiddleAge     3
YoungAdult    3
Senior        1
dtype: int64

### Zscore normalization

Zscaling allows us to transform features so that they have a standard normal distribution with a mean of zero and a standard deviation of 1. 

We can use sklearns preprocessing.scale method to scale input dataframes: 

In [9]:
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing

# load the Iris dataset
iris = load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [4]:
# separate the data and target attributes
X = iris.data
y = iris.target

# standardize the data attributes and cast to dataframe
standardized_X = pd.DataFrame(preprocessing.scale(X))
standardized_X.head()

Unnamed: 0,0,1,2,3
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


### MinMax Scaling: Scale to between 0 and 1

In [5]:
min_max_scaler = preprocessing.MinMaxScaler()

# Copy the iris data for min-max scaling
data_minmax = X.copy()

# Scale on the copied data and display the first 5 rows
data_minmax = pd.DataFrame(min_max_scaler.fit_transform(data_minmax))
data_minmax.head()

Unnamed: 0,0,1,2,3
0,0.222222,0.625,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.5,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667


### Normalizing Data
We can also use sklearn's normalize function to tranform values to a range from 0 to 1 on a sample basis, rather than on a feature basis as seen in scaling methods: 

In [6]:
# Normalize the data attributes for the Iris dataset.

# Copy the irisn data for normalizing
X2 = X.copy()

# normalize the data attributes and display the first 5 rows
normalized_X = pd.DataFrame(preprocessing.normalize(X2))
normalized_X.head()

Unnamed: 0,0,1,2,3
0,0.803773,0.551609,0.220644,0.031521
1,0.828133,0.50702,0.236609,0.033801
2,0.805333,0.548312,0.222752,0.034269
3,0.80003,0.539151,0.260879,0.034784
4,0.790965,0.569495,0.22147,0.031639
