### Rescaling a Feature
You need to rescale the values of a numerical feature to be between two values.

In [3]:
import numpy as np
from sklearn import preprocessing

In [8]:
# Create feature
feature = np.array([[-500.5],
 [-100.1],
 [0],
 [100.1],
 [900.9]])

# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)

# Show feature
scaled_feature

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

scikit-learn’s MinMaxScaler offers two options to rescale a feature. One option is to
use fit to calculate the minimum and maximum values of the feature, then use trans
form to rescale the feature. The second option is to use fit_transform to do both
operations at once. There is no mathematical difference between the two options, but
there is sometimes a practical benefit to keeping the operations separate because it
allows us to apply the same transformation to different sets of the data.


### Standardizing a Feature
You want to transform a feature to have a mean of 0 and a standard deviation of 1.

In [12]:
x = np.array([[-1000.1],
             [-200.2],
             [500.5],
             [600.6],
             [9000.9]])

#Create Scaler
scaler = preprocessing.StandardScaler()

#Transform the feature
standardized = scaler.fit_transform(x)

#Show feature
standardized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

If our data has significant outliers, it can negatively impact our standardization by
affecting the feature’s mean and variance. In this scenario, it is often helpful to instead
rescale the feature using the median and quartile range. In scikit-learn, we do this
using the RobustScaler method:

In [14]:
# Create Scaler
robust_scaler = preprocessing.RobustScaler()

#Transform feature
robust_scaler.fit_transform(x)


array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

### Normalizing Observations
You want to rescale the feature values of observations to have unit norm (a total
length of 1).

Normalizer provides three norm options with Euclidean norm (often called L2)
being the default argument:

In [17]:
from sklearn.preprocessing import Normalizer

#Create feature matrix
feature = np.array([[0.5, 0.5],
                   [1.1, 3.4],
                   [1.5, 20.2],
                   [1.63, 34.4],
                   [10.9, 3.3]])

#Create Normalizer
normalizer = Normalizer(norm='l2')

#Transform feature matrix
normalizer.transform(feature)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

Alternatively, we can specify Manhattan norm (L1):

In [20]:
feature_l1_norm = Normalizer(norm='l1').transform(feature)

#Show features 
feature_l1_norm


array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

### Generating Polynomial and Interaction Features
You want to create polynomial and interaction features

In [22]:
from sklearn.preprocessing import PolynomialFeatures

#Create feature matrix
features = np.array([[2,3],
                    [2,3],
                   [2,3]])

#Create polynomialfeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False )

#Create polynomial features
polynomial_interaction.fit_transform(features)


array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

### Transforming Features
You want to make a custom transformation to one or more features.

In [25]:
from sklearn.preprocessing import FunctionTransformer

#Create Feature Matrix
features = np.array([[2,3],
                    [2,3],
                    [2,3]])

#Define a simple function
def add_ten(x):
    return x+10

#Create Transformer
ten_transformer = FunctionTransformer(add_ten)

#Transform feature matrix
ten_transformer.transform(features)




array([[12, 13],
       [12, 13],
       [12, 13]])

We can create the same transformation in pandas using apply:

In [27]:
import pandas as pd

#Create Dataframe
df=pd.DataFrame(features, columns=['feature1', 'feature2'])

#apply function
df.apply(add_ten)

Unnamed: 0,feature1,feature2
0,12,13
1,12,13
2,12,13


### Detecting Outliers
You want to identify extreme observation.

Detecting outliers is unfortunately more of an art than a science. However, a common
method is to assume the data is normally distributed and based on that assumption
“draw” an ellipse around the data, classifying any observation inside the ellipse as an
inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as
-1):

In [30]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

#Create simulated data
features, _= make_blobs(n_samples=10,
                       n_features=2,
                       centers=1,
                       random_state=1)

#Replace the first observation's values with extreme values
features[0,0]=10000
features[0,1]=10000

#Create detector
outlier_detector=EllipticEnvelope(contamination=.1)

#fit detector
outlier_detector.fit(features)

#predict outliers
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

 If we expect our data to have few outliers, we can set contamination to something small.
However, if we believe that the data is very likely to have outliers, we can set it to a
higher value.

### Handling Outliers
Typically we have three stategies we can use to handle outliers. First, we can drop them:

In [33]:
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

#Filter observations
houses[houses['Bathrooms']<20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


Second, We can mark them as outliers and include it as a feature:


In [35]:
houses['Outlier']=np.where(houses['Bathrooms']<20,0,1)
#show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


Finally, we can transform the feature to dampen the effect of the outlier:

In [37]:
#log feature
houses['Log_Of_Square_Feet']=[np.log(x) for x in houses['Square_Feet']]

#Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


### Discretizating Features
You have a numerical features and want to break it up into discrete bins.

Depending on how we want to break up the data, there are two techniques we can
use. First, we can binarize the feature according to some threshold

In [41]:
from sklearn.preprocessing import Binarizer

#Create feature
age=np.array([[6],
             [12],
             [20],
             [36],
             [65]])

#Create binarizer
binarizer = Binarizer(18)

#Transform feature
binarizer.fit_transform(age)


array([[0],
       [0],
       [1],
       [1],
       [1]])

Second, we can break up numerical features according to multiple thresholds:

In [43]:
#Bin feature
np.digitize(age, bins=[20,30,64])


array([[0],
       [0],
       [1],
       [2],
       [3]], dtype=int64)

### Grouping Observations Using Clustering
You want to cluster observations so that similar observations are grouped together

In [45]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Make simulated feature matrix
features, _ = make_blobs(n_samples = 50,
                         n_features = 2,
                         centers = 3,
                         random_state = 1)

# Create DataFrame
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Make k-means clusterer
clusterer = KMeans(3, random_state=0)
# Fit clusterer
clusterer.fit(features)
# Predict values
dataframe["group"] = clusterer.predict(features)
# View first few observations
dataframe.head(5)

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2


### Deleting Observations with missing values

In [47]:
# Create feature matrix
features = np.array([[1.1, 11.1],
 [2.2, 22.2],
 [3.3, 33.3],
 [4.4, 44.4],
 [np.nan, 55]])
# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

In [48]:
#Or
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4
