<a href="https://colab.research.google.com/github/Bluelord/ML_Cookbook/blob/main/Handling_Numerical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Handling Numerical Data**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML Cookbook/Datasets/Boston-housing.csv", index_col=0)

X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data.iloc[:,-1])
data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [10]:
# Rescaling a Features 

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer

# MinMaxScaler
minmax_scale = MinMaxScaler(feature_range=(0,1))
scaled_features = minmax_scale.fit_transform(X_train)
print("MinMaxScaled: \n{}\n" .format(scaled_features))

# StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(X_train)
print("Standardized: \n {}\n:" .format(standardized))

# RobustScaler
robust_scaler = RobustScaler()
scaled = robust_scaler.fit_transform(X_train)
print("Standardized: \n {}\n:" .format(scaled))

# Normalization L2 & L2 
L2_normalizer = Normalizer(norm='l2')
L2_normalized = L2_normalizer.transform(X_train)
print("L2_Normalized: \n {}\n:" .format(L2_normalized))

L1_normalizer = Normalizer(norm='l1')
L1_normalized = L1_normalizer.transform(X_train)
print("L1_Normalized: \n {}\n:" .format(L1_normalized))

MinMaxScaled: 
[[3.87569366e-03 0.00000000e+00 2.53665689e-01 ... 7.44680851e-01
  1.00000000e+00 1.69361702e-01]
 [1.26389965e-02 0.00000000e+00 2.81524927e-01 ... 8.93617021e-01
  9.07383126e-01 5.92056738e-01]
 [3.31797683e-04 8.42105263e-01 1.06671554e-01 ... 3.72340426e-01
  1.00000000e+00 8.34042553e-02]
 ...
 [1.50837564e-03 0.00000000e+00 3.71334311e-01 ... 6.38297872e-01
  9.72035907e-01 2.17021277e-01]
 [7.63342381e-01 0.00000000e+00 6.46627566e-01 ... 8.08510638e-01
  9.69917797e-01 6.02836879e-01]
 [3.14151261e-04 0.00000000e+00 1.73387097e-01 ... 8.08510638e-01
  1.00000000e+00 2.28936170e-01]]

Standardized: 
 [[-0.3870016  -0.48047944 -0.57784798 ...  0.508366    0.44298387
  -0.72236702]
 [-0.29863303 -0.48047944 -0.46733151 ...  1.15662122  0.04530852
   1.34405319]
 [-0.42273801  3.09671799 -1.16096779 ... -1.11227206  0.44298387
  -1.14258536]
 ...
 [-0.41087348 -0.48047944 -0.1110613  ...  0.04532655  0.32291255
  -0.48937468]
 [ 7.27141052 -0.48047944  0.98101594 .

**min-max scaling**

Rescaling is a common preprocessing task in machine learning, many of the algorithms assume all features are on the same scale, typically
0 to 1 or –1 to 1.

There are a number of rescaling techniques, but one of the
simplest is called min-max scaling, min-max calculates:

$x_i^′ = \frac{x_i − min(x)}{max(x) − min(x)}$

where $x$ is the feature vector, $x_i$ is an individual element of feature $x$, and $x_i^`$ is rescaled element.

**Standerdizing Features**

To achieve standard normally distributed, we use standardization to transform the data such that it has a mean, $\bar{x}$, of $0$ and a standard deviation, $\sigma$, of $1$. This standardized value is also called a z-score in statistics.

$x_i^′ = \frac{x_i − \bar{x}} {\sigma}$

Standardization is a common go-to scaling method for machine learning preprocessing. However, it depends on the learning algorithm. For example, PCA often works better using standardization, while min-max scaling is often recommended for neural networks.

If our data has significant outliers, it can negatively impact our standardization by affecting the feature’s mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range, by using the RobustScaler method.

**Normalizing Obsevations**

We can also rescale across individual observations. Normalizer
rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features.

Euclidean norm (often called **L2**) being the default argument:$||x||_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$

Manhattan norm (**L1**): $||x||_1 = \sum_{i=1}^{n}|x_i|$

In [11]:
# Generating Polynomial & interaction features
from sklearn.preprocessing import PolynomialFeatures

polynomial_interaction = PolynomialFeatures(degree= 2, include_bias=False)
polynoamial_features = polynomial_interaction.fit_transform(X_train)
print("Polynomial features: \n {}\n:" .format(polynoamial_features))

# We can restrict the features created to only interaction features by setting interaction_only to True:
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
new_poly_features = interaction.fit_transform(X_train)
print("Polynomial features: \n {}\n:" .format(new_poly_features))

Polynomial features: 
 [[3.51140000e-01 0.00000000e+00 7.38000000e+00 ... 1.57529610e+05
  3.05613000e+03 5.92900000e+01]
 [1.13081000e+00 0.00000000e+00 8.14000000e+00 ... 1.29722429e+05
  8.13984200e+03 5.10760000e+02]
 [3.58400000e-02 8.00000000e+01 3.37000000e+00 ... 1.57529610e+05
  1.85352300e+03 2.18089000e+01]
 ...
 [1.40520000e-01 0.00000000e+00 1.05900000e+01 ... 1.48849356e+05
  3.61889780e+03 8.79844000e+01]
 [6.79208000e+01 0.00000000e+00 1.81000000e+01 ... 1.48201901e+05
  8.84661060e+03 5.28080400e+02]
 [3.42700000e-02 0.00000000e+00 5.19000000e+00 ... 1.57529610e+05
  3.88962000e+03 9.60400000e+01]]
:
Polynomial features: 
 [[3.5114000e-01 0.0000000e+00 7.3800000e+00 ... 7.7792400e+03
  1.5092000e+02 3.0561300e+03]
 [1.1308100e+00 0.0000000e+00 8.1400000e+00 ... 7.5635700e+03
  4.7460000e+02 8.1398420e+03]
 [3.5840000e-02 8.0000000e+01 3.3700000e+00 ... 6.3900900e+03
  7.5187000e+01 1.8535230e+03]
 ...
 [1.4052000e-01 0.0000000e+00 1.0590000e+01 ... 7.1760660e+03
  1.74

Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target.

In [12]:
# Transforming Features 

from sklearn.preprocessing import FunctionTransformer
def add_ten(x):
  return x+10

trans = FunctionTransformer(add_ten)
transformed = trans.transform(X_train.iloc[:,2:4])
print("Transformed Features: \n{}\n" .format(transformed))

# Same transformation can be done on pandas too using apply()
print("Tranformation using apply function: \n{}\n" .format(X_train.iloc[:,2:4].apply(add_ten)))

Transformed Features: 
     indus  chas
323  17.38    10
31   18.14    10
66   13.37    10
102  18.56    10
135  31.89    10
..     ...   ...
215  20.59    10
291  14.95    10
214  20.59    10
406  28.10    10
337  15.19    10

[379 rows x 2 columns]

Tranformation using apply function: 
     indus  chas
323  17.38    10
31   18.14    10
66   13.37    10
102  18.56    10
135  31.89    10
..     ...   ...
215  20.59    10
291  14.95    10
214  20.59    10
406  28.10    10
337  15.19    10

[379 rows x 2 columns]



**Deleting the outliers**
A common method is to assume the data is normally distributed and based on that assumption “draw” an ellipse around the data, classifying any observation inside the ellipse as an inlier.

In [13]:
# Deleting the outliers 

from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
feature, _ = make_blobs(n_samples = 10, n_features = 2, centers = 1, random_state = 70)

# replacing the obser vation values with the extreme values
feature[0,0] = 10000
feature[0,1] = 10000

outliers_detection = EllipticEnvelope(contamination = 0.1)
outliers_detection.fit(feature)
outliers_detection.predict(feature)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

A major limitation of this approach is the need to specify a contamination parameter, which is the proportion of observations that are outliers—a value that we don’t know. Instead of looking at observations as a whole, we can instead look at individual features and identify extreme values in those features using interquartile range (IQR):

In [17]:
feature = X_train

# Create a function to return index of outliers
def indicies_of_outliers(x):
  q1, q3 = np.percentile(x, [25, 75])
  iqr = q3 - q1
  lower_bound = q1 - (iqr * 1.5)
  upper_bound = q3 + (iqr * 1.5)
  return np.where((x > upper_bound) | (x < lower_bound))

indicies_of_outliers(feature)

(array([  0,   0,   1, ..., 377, 378, 378]),
 array([ 9, 11,  6, ..., 11,  9, 11]))

Handling the outliers:

*   Typically we have three strategies we can use to handle outliers. First, we can drop.
*   We can mark them as outliers and include it as a feature.
*   We can transform the feature to dampen the effect of the outlier.

We can handle them based on two aspects.
> we should consider what makes them an outlier.
If we believe they are errors in the data such as from a broken sensor or a miscoded value, then we might drop the observation or replace outlier values with NaN. If we believe the outliers are genuine extreme values, then marking them as outliers or transforming their values is more appropriate.

> How we handle outliers should be based on our goal for machine learning.
If we want to predict house prices based on features of the house, we
might reasonably assume the price for mansions with over 100 bathrooms is driven
by a different dynamic than regular family homes.
If we are training a model to use as part of an online home loan web application, we might assume that nour potential users will not include billionaires looking to buy a mansion.

In [18]:
# Handlng Outliers
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

#  (1) Filter observations
print("Filter value: \n{}\n" 
      .format(houses[houses['Bathrooms'] < 20]))
# (2) Features based on boolean condition
print("Features based on boolean condition: \n{}\n" 
      .format(np.where(houses["Bathrooms"] < 20, 0, 1)))

# (3) Log feature
print("Transfoemed features: \n{}\n" 
      .format([np.log(x) for x in houses["Square_Feet"]]))

Filter value: 
    Price  Bathrooms  Square_Feet
0  534433        2.0         1500
1  392333        3.5         2500
2  293222        2.0         1500

Features based on boolean condition: 
[0 0 0 1]

Transfoemed features: 
[7.313220387090301, 7.824046010856292, 7.313220387090301, 10.778956289890028]



**Discretizating Features**

Depending on how we want to break up the data, there are two techniques
*   We can binarize the feature according to some threshold
*   We can break up numerical features according to multiple thresholds.

Discretization can be a fruitful strategy when we have reason to believe that a 
feature should behave more like a categorical feature.

In [19]:
 # Discretizating Features

 from sklearn.preprocessing import Binarizer 

 age = np.array([[6],
                [8],
                [15],
                [30],
                [50]])
# Breaking up the features by Binerizing the data by thresholding 
binerizer = Binarizer(16)
print("Binerized features:\n{}\n" 
      .format(binerizer.fit_transform(age)))

# Breaking the features to multiple thresholds
print("Breaking wiht multiple threshold: \n{}\n" 
      .format(np.digitize(age, bins=[10,20,40])))


Binerized features:
[[0]
 [0]
 [0]
 [1]
 [1]]

Breaking wiht multiple threshold: 
[[0]
 [0]
 [1]
 [2]
 [3]]



In [21]:
# Grouping Observations Using Clustering

# we can use clustering as a preprocessing step.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

features, _ = make_blobs(n_samples = 50, n_features = 2, centers = 3, random_state = 70)
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# Make k-means clusterer
clusterer = KMeans(3, random_state=0)
clusterer.fit(features)

# Predict values
dataframe["group"] = clusterer.predict(features)
# View first few observations
dataframe.head(5)

# Details of this will be in K-Mean clustering algorithms

Unnamed: 0,feature_1,feature_2,group
0,11.639411,6.210485,1
1,5.514475,7.033964,1
2,-3.92997,-7.583885,0
3,8.346673,8.549041,1
4,-2.775634,-6.824632,0


In [24]:
# Deleting Observations with Missing Values

dataframe = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML Cookbook/Datasets/titanic.csv")

# Droping the missing values 
print("Removed observations with missing values: \n {}\n" 
      .format(dataframe.dropna()))

Removed observations with missing values: 
      PassengerId  Survived  Pclass  ...     Fare        Cabin  Embarked
1              2         1       1  ...  71.2833          C85         C
3              4         1       1  ...  53.1000         C123         S
6              7         0       1  ...  51.8625          E46         S
10            11         1       3  ...  16.7000           G6         S
11            12         1       1  ...  26.5500         C103         S
..           ...       ...     ...  ...      ...          ...       ...
871          872         1       1  ...  52.5542          D35         S
872          873         0       1  ...   5.0000  B51 B53 B55         S
879          880         1       1  ...  83.1583          C50         C
887          888         1       1  ...  30.0000          B42         S
889          890         1       1  ...  30.0000         C148         C

[183 rows x 12 columns]



Out of 891 only 183 remains, means we have to handel this NAN, whithout droping.

Depending on the cause of the missing values, deleting observations can introduce bias into our data. There are three types of missing data:
*   Missing Completely At Random (MCAR): The probability that a value is missing is independent of everything.
*   Missing At Random (MAR): The probability that a value is missing is not completely random, but depends on the information captured in other features.
*   Missing Not At Random (MNAR): The probability that a value is missing is not random and depends on information not captured in our features.

It is sometimes acceptable to delete observations if they are MCAR or MAR, if the value is MNAR, the fact that a value is missing is itself information. Deleting MNAR observations can inject bias into our data because we are removing observations produced by some effect.



In [31]:
# Imputing Missing Values

# Load libraries
from fancyimpute import KNN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000, n_features = 2, random_state = 1)

# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# Replace the first feature's first value with a missing value
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan

# Predict the missing values in the feature matrix
features_knn_imputed = KNN(k=5, verbose=0).fit_transform(features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_knn_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996




In [38]:
# Load library

from sklearn.impute import SimpleImputer
# Create imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Impute values
features_mean_imputed = imp_mean.fit_transform(features)
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


There are two main strategies for replacing missing data with substitute values, each of which has strengths and weaknesses.
*   We can use machine learning to predict the values of the missing data, for this we treat the feature with missing values as a target vector and use the remaining subset of features to predict missing values.

*  An alternative and more scalable strategy is to fill in all missing values with some average value.
