Quantitative data is the measurement of something—whether class size, monthly
sales, or student scores. The natural way to represent these quantities is numerically
(e.g., 29 students, $529,392 in sales).

<font color='blue'>**1. Rescaling a Feature**</font>

Problem : You need to rescale the value of a numerical feature to be between two values.

In [None]:
# Use scikit-learn's MinMaxScaler to rescale a feature array

#load libraries
import numpy as np
from sklearn import preprocessing

#Create Feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])

minmax_scale = preprocessing.MinMaxScaler(feature_range=(-1,1))
#scale feature
scaled_feature = minmax_scale.fit_transform(feature)

#show feature
scaled_feature

array([[-1.        ],
       [-0.42857143],
       [-0.28571429],
       [-0.14285714],
       [ 1.        ]])

<font color='blue'>**2. Standardize a Feature**</font>

Problem : You need to transform a feature to have mean of 0 and standard deviation of 1.

In [None]:
#load libraries
from sklearn import preprocessing

#Create feature
x = np.array([[-1000.1],
              [-200.2],
              [500.5],
              [600.6],
              [9000.9]])

scaler = preprocessing.StandardScaler()

standardized = scaler.fit_transform(x)
print(standardized)

[[-0.76058269]
 [-0.54177196]
 [-0.35009716]
 [-0.32271504]
 [ 1.97516685]]


In [None]:
#print the mean and standard deviation

print("Mean",round(standardized.mean()))
print("Standard Deviation",standardized.std())

Mean 0
Standard Deviation 1.0


<font color='blue'>**3. Normalizing Observations**</font>

Problem : You want to rescale the feature values of observations to have unit norm ( a total length of 1 ).

Solution: Use Normalizer with a norm argument

In [None]:
import numpy as np
from sklearn.preprocessing import Normalizer

#create feature matrix
features = np.array([[0.5, 0.5],
                      [1.1, 3.4],
                      [1.5, 20.2],
                      [1.63, 34.4],
                      [10.9, 3.3]])

normalizer = Normalizer(norm='l2')

#transform Feature matrix 
normalizer.transform(features)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

In [None]:
#Alternatively, we can specify Manhattan Norm(L1):

feature_l1_norm = Normalizer(norm='l1').transform(features)

feature_l1_norm

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

<font color='blue'>**4. Generating Polynomial and Interaction Features**</font>

Problem : You want to create Polynomial and Interaction Features

Solution: Even though some choose to create polynomial and interaction features manually, scikit-learn offers a build-in method

In [None]:
from sklearn.preprocessing import PolynomialFeatures

features = np.array([[2,3],
                     [2,3],
                     [2,3]])

#create PolynomialFeature OBject

polynomial_interaction = PolynomialFeatures(degree=2,include_bias=False)

polynomial_interaction.fit_transform(features)

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

The `degree` parameter determines the maximum degree of the polynomial, For example, degree-2 will create new features raise to the second power. 

$x_1,x_2,x_1^2,x_2^2$

Furthermore, by default PolynomialFeatures includes interaction features.: 

$x_1x_2$

We can restrict the feature only to interaction by setting `interaction_only` to `True`.

In [None]:
interaction = PolynomialFeatures(degree=2,interaction_only=True,include_bias=False)

interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

<font color='blue'>**5. Transforming Features**</font>

Problem : You want to make a custom transformation to one or more features

Solution: In skicit-learn, use FunctionTransformer to apply a function to a set of Features:

In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

features = np.array([[2,3],
                     [2,3],
                     [2,3]])

#define a simple function
def add_ten(x):
  return x+10

#create transformer
ten_transformer = FunctionTransformer(add_ten)
ten_transformer.fit_transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

In [None]:
# We can create the same transformation in pandas using apply
import pandas as pd
df = pd.DataFrame(features,columns=['Feature1','Feature2'])

df.apply(add_ten)

Unnamed: 0,Feature1,Feature2
0,12,13
1,12,13
2,12,13


<font color='blue'>**6. Detecting Outliers**</font>

Problem : You want to identify extreme observations

Solution: Detecting outliers is unfortunately more of an art than a science. However, a common method is to assume the data is normally distributied and based on that assumption " draw" an ellipse around the data, classifying any observation inside the ellipse as an outlier(labelled as 1) and any observation outside the ellipse as an outlier (labelled as -1)

In [None]:
#load libraries 
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

#Create simulated data
features,_ = make_blobs(n_samples =10,
                        n_features = 2,
                        centers = 1,
                        random_state =1)
#Replace the first observation's values with extreme values 
features[0,0] = 1000
features[0,1] = 1000

#create detector
outlier_detector = EllipticEnvelope(contamination=.1)

#Fit Detector
outlier_detector.fit(features)
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

major limitation of this approach is the need to specify a contamination parameter,
which is the proportion of observations that are outliers—a value that we don’t
know. Think of contamination as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set contamination to something small.


Instead of looking at observations as a whole, we can instead loop at individual features and identify extreme values in those features using the interquartile range(IQR)

In [None]:
#create one feature
feature = features[:,0]

# create a function to return index of outliers 
def indices_of_outliers(x):
  q1,q3 = np.percentile(x,[25,75])
  iqr = q3 - q1
  lower_bound = q1 - (iqr * 1.5)
  upper_bound = q3 +( iqr * 1.5)
  return np.where((x> upper_bound)|(x<lower_bound))

In [None]:
indices_of_outliers(feature)

(array([0]),)

IQR is the difference between the first and third quartile of a data set. You can think of IQR as the spread of the bulk of the data, with outliers being observations far from the main concentration of data. **Outliers are commonly defined as any value 1.5 IQR's less than the first quartile or 1.5 IQR's greater than the third quartile.**

<font color='blue'>**7. Handling Outliers**</font>

Problem : You have Outliers

In [None]:
# Load library
import pandas as pd
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
# Filter observations
houses[houses['Bathrooms'] < 20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


Second, we can mark them as outliers and include it as a feature

In [None]:
import numpy as np

houses['Outlier'] = np.where(houses['Bathrooms']<20,0,1)
#show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


In [None]:
#finally, we can transform the feature to damper the effect of the 
#outlier

houses['log_of_Square_feet'] = [np.log(x) for x in houses['Square_Feet']]

#Show Data
houses
                                

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,log_of_Square_feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


<font color='blue'>**8. Discretizating Features**</font>

Problem : You have a numerical feature and want to break it up into discrete bins.

In [None]:
from sklearn.preprocessing import Binarizer

age = np.array([[6],
                [12],
                [20],
                [36],
                [65]])

#Create Binarizer
binarizer1 = Binarizer()
binarizer1.fit_transform(age)

#show data
binarizer1

TypeError: ignored

<font color='blue'>**9. Grouping Observations Using Clustering**</font>

Problem : You want to cluster observations so that similar observations are grouped togeter

Solution : If you know that you have $k$ groups you can use k-means clustering to group similar observations and output a new feature contraining each observation's group membership

In [None]:
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

#Make simulated feature matrix 
features,_ = make_blobs(n_samples=50,
                        n_features = 2,
                        centers = 3,
                        random_state =1)
#create Dataframe
df = pd.DataFrame(features,columns=['Feature1','Feature2'])

#Make k-means clusters
clusterer = KMeans(3,random_state=0)
clusterer.fit(features)

#Predict values 
df['group'] = clusterer.predict(features)
df.head(5)

Unnamed: 0,Feature1,Feature2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2


<font color='blue'>**10. Detecting Observations with Missing Values**</font>

Problem : You want to delete observations containing missing values.

Solution : Deleting observations with missing value is easy with a cleaver line in Numpy

In [None]:
# Create feature matrix
features = np.array([[1.1, 11.1],
                    [2.2, 22.2],
                    [3.3, 33.3],
                    [4.4, 44.4],
                    [np.nan, 55]])

#Keep only the observations that are not (denoted by ~)missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

In [None]:
#Alternatively, we can drop missing observations using Pandas

df = pd.DataFrame(features,columns=[1,2])
df.dropna()

Unnamed: 0,1,2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


<font color='blue'>**11. Imputing Missing Values**</font>

Problem : You have missing values in your data and want to fill in or predict their value.

Solution : If you have a small amount of data, predict the missing values using the k-nearest neighbours(KNN):

In [None]:
import numpy as np
!pip install fancyimpute
from fancyimpute import KNN

Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
Collecting knnimpute>=0.1.0
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 39.8 MB/s 
Building wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29899 sha256=e289b5d5dcdfb26c2f62a6f6cb647cbf0d8d9cd29722fa8dcf5d321350befcab
  Stored in directory: /root/.cache/pip/wheels/e3/04/06/a1a7d89ef4e631ce6268ea2d8cde04f7290651c1ff1025ce68
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11353 sha256=c7ed30d103185617d5ae1e33f73b027bd6362173e280ca5a6054daea87a2a369
  Stored in directory: /root/.cache/pip/wheels/72/21/a8/a045cacd9838abd5643f6bfa852c0796a99d6b1494760494e0
Success

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)

#Standardize the features
scaler = StandardScaler()
standardized_Features = scaler.fit_transform(features)

# Replace the first feature's first value with a missing value
true_value = standardized_Features[0,0]
standardized_Features[0,0] = np.nan


#Predict the missing value in the feature matrix
knn_imputer = KNN()
standardized_Features[:,:] = knn_imputer.fit_transform(standardized_Features)

#Compare the true and Imputed values 
print("True Value",true_value)
print("Imputed Value",standardized_Features[0,0])

Imputing row 1/1000 with 1 missing, elapsed time: 0.211
Imputing row 101/1000 with 0 missing, elapsed time: 0.212
Imputing row 201/1000 with 0 missing, elapsed time: 0.213
Imputing row 301/1000 with 0 missing, elapsed time: 0.214
Imputing row 401/1000 with 0 missing, elapsed time: 0.215
Imputing row 501/1000 with 0 missing, elapsed time: 0.216
Imputing row 601/1000 with 0 missing, elapsed time: 0.217
Imputing row 701/1000 with 0 missing, elapsed time: 0.218
Imputing row 801/1000 with 0 missing, elapsed time: 0.219
Imputing row 901/1000 with 0 missing, elapsed time: 0.220
True Value 0.8730186113995938
Imputed Value 1.0955332713113226
