<a href="https://colab.research.google.com/github/Bluelord/ML_Cookbook/blob/main/Handling_Numerical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Handling Numerical Data**

In [None]:
import numpy as np
import pandas as pd


data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datafiles/Boston-housing.csv')
#dataframe = data.drop(data.iloc[:,0], axis=1)

features = data.iloc[:,1:10]

In [None]:
# Rescaling a Features 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer

features = data.iloc[:,1:10] # Using 10 features from the dataset

# MinMaxScaler
minmax_scale = MinMaxScaler(feature_range=(0,1))
scaled_features = minmax_scale.fit_transform(features)
print("MinMaxScaled: \n{}\n" .format(scaled_features))

# StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(features)
print("Standardized: \n {}\n:" .format(standardized))

# If our data has significant outliers, it can negatively impact our standardization by
# affecting the feature’s mean and variance. In this scenario, it is often helpful to instead
# rescale the feature using the median and quartile range.

# RobustScaler
robust_scaler = RobustScaler()
scaled = robust_scaler.fit_transform(features)
print("Standardized: \n {}\n:" .format(scaled))

# Normalization L2 & L2 
L2_normalizer = Normalizer(norm='l2')
L2_normalized = L2_normalizer.transform(features)
print("L2_Normalized: \n {}\n:" .format(L2_normalized))

L1_normalizer = Normalizer(norm='l1')
L1_normalized = L1_normalizer.transform(features)
print("L1_Normalized: \n {}\n:" .format(L1_normalized))

In [None]:
# Generating Polynomial & interaction features
from sklearn.preprocessing import PolynomialFeatures

polynomial_interaction = PolynomialFeatures(degree= 2, include_bias=False)
polynoamial_features = polynomial_interaction.fit_transform(features)
print("Polynomial features: \n {}\n:" .format(polynoamial_features))

# We can restrict the features created to only interaction features by setting interaction_only to True:
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
new_poly_features = interaction.fit_transform(features)
print("Polynomial features: \n {}\n:" .format(new_poly_features))

In [None]:
# Transforming Features 

from sklearn.preprocessing import FunctionTransformer
def add_ten(x):
  return x+10

transformer = FunctionTransformer(add_ten)
transformed = transformer.transform(features)

print("Transformed Features: \n{}\n" .format(transformed))

# Same transformation can be done on pandas too using apply()
apply = features.apply(add_ten)

print("Tranformation using apply function: \n{}\n" .format(apply))

In [None]:
# Deleting the outliers 

import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
feature, _ = make_blobs(n_samples = 10,
                         n_features = 2,
                         centers = 1,
                         random_state = 1)
# replacing the obser vation values with the extreme values
feature[0,0] = 10000
feature[0,1] = 10000

outliers_detection = EllipticEnvelope(contamination = 0.1)
outliers_detection.fit(feature)
outliers_detection.predict(feature)



A major limitation of this approach is the need to specify a contamination parameter, which is the proportion of observations that are outliers—a value that we don’t know.

Instead of looking at observations as a whole, we can instead look at individual features and identify extreme values in those features using interquartile range (IQR):

In [None]:
feature = features.iloc[:,0]

# Create a function to return index of outliers
def indicies_of_outliers(x):
  q1, q3 = np.percentile(x, [25, 75])
  iqr = q3 - q1
  lower_bound = q1 - (iqr * 1.5)
  upper_bound = q3 + (iqr * 1.5)
  return np.where((x > upper_bound) | (x < lower_bound))

indicies_of_outliers(feature)


Handling the outliers:

*   Typically we have three strategies we can use to handle outliers. First, we can drop.
*   We can mark them as outliers and include it as a feature.
*   We can transform the feature to dampen the effect of the outlier.

We can handle them based on two aspects.
*   we should consider what makes them an outlier.

If we believe they are errors in the data such as from a broken sensor or a miscoded value, then we might drop the observation or replace outlier
values with NaN.

If we believe the outliers are genuine extreme values, then marking them as outliers or transforming their values is more appropriate.


*   How we handle outliers should be based on our goal for machine learning.

If we want to predict house prices based on features of the house, we
might reasonably assume the price for mansions with over 100 bathrooms is driven
by a different dynamic than regular family homes.

If we are training a model to use as part of an online home loan web application, we might assume that nour potential users will not include billionaires looking to buy a mansion.



In [None]:
# Handlng Outliers
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

#  (1) Filter observations
print("Filter value: \n{}\n" 
      .format(houses[houses['Bathrooms'] < 20]))
# (2) Features based on boolean condition
print("Features based on boolean condition: \n{}\n" 
      .format(np.where(houses["Bathrooms"] < 20, 0, 1)))

# (3) Log feature
print("Transfoemed features: \n{}\n" 
      .format([np.log(x) for x in houses["Square_Feet"]]))

In [None]:
 # Discretizating Features

 from sklearn.preprocessing import Binarizer 

 age = np.array([[6],
                [8],
                [15],
                [30],
                [50]])
# Breaking up the features by Binerizing the data by thresholding 
binerizer = Binarizer(16)
print("Binerized features:\n{}\n" 
      .format(binerizer.fit_transform(age)))

# Breaking the features to multiple thresholds
print("Breaking wiht multiple threshold: \n{}\n" 
      .format(np.digitize(age, bins=[10,20,40])))


In [None]:
# Grouping Observations Using Clustering
# we can use clustering as a preprocessing step.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

features, _ = make_blobs(n_samples = 50,
                         n_features = 2,
                         centers = 3,
                         random_state = 1)


dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# Make k-means clusterer
clusterer = KMeans(3, random_state=0)
clusterer.fit(features)

# Predict values
dataframe["group"] = clusterer.predict(features)
# View first few observations
dataframe.head(5)

In [None]:
# Deleting Observations with Missing Values

features = np.array([[1.1, 11.1],
                     [2.2, 22.2],
                     [3.3, 33.3],
                     [4.4, 44.4],
                     [np.nan, 55]])

# Keep only observations that are not (denoted by ~) missing
print("Values which are not NaN: \n {}\n: " 
      .format(features[~np.isnan(features).any(axis=1)]))

# Droping the missing values 
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
print("Removed observations with missing values: \n {}\n" 
      .format(dataframe.dropna()))

Depending on the cause of the missing values, deleting observations can introduce bias into our data. There are three types of missing data:
*   Missing Completely At Random (MCAR): The probability that a value is missing is independent of everything.
*   Missing At Random (MAR): The probability that a value is missing is not completely random, but depends on the information captured in other features.
*   Missing Not At Random (MNAR): The probability that a value is missing is not random and depends on information not captured in our features.

It is sometimes acceptable to delete observations if they are MCAR or MAR, if the value is MNAR, the fact that a value is missing is itself information. Deleting MNAR observations can inject bias into our data because we are removing observations produced by some unobserved systematic effect.



In [None]:
# Imputing Missing Values
# If you have missing values in your data and want to fill in or predict their values.

from sklearn.Imputer import 

mean_imputer = Imputer(strategy="mean", axis=0)

features_mean_imputed = mean_imputer.fit_transform(features)
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

There are two main strategies for replacing missing data with substitute values, each of which has strengths and weaknesses.


First, we can use machine learning to predict the values of the missing data, for this we treat the feature with missing values as a target vector and use the remaining subset of features to predict missing values.

An alternative and more scalable strategy is to fill in all missing values with some average value.
