<a href="https://colab.research.google.com/github/Deepan-mn/Machine_Learning_Techniques/blob/main/Numerical_Data/Handling_Numerical__Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Handling Numerical Data**

##**Introduction**<br>
Numerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form.<br>
Qunatitative data is the measurement of like- whether class size, monthly sales, or student scores. The natural way to represent these quantities is numerically(eg. 30 students , 1000rupees in sales).

##**Rescaling a Feature**

Rescalling is a common preprocessing task in machine learning.Many of the Algorithms will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescalling techniques,but one of the simplest is called **min-max scaling** in scikit-learn library. Min-Max  scaling uses the minimum and maximum values of a feature to rescale values to with a range. Specifically,min-max calculates **x(i) = x(i)-min(x)/max(x)-min(x)** where x is the feature vector , x(i) is an individual element of feature  x and x(i) is the **rescaled element**.  This mim-max scaler is mostly used in the Neural Network

In [4]:
import pandas as pd
import numpy as np

In [None]:
#Use Scikit-learn's MinMaxScaler to rescale a feature array
from sklearn import preprocessing
# Create Feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])
#Create Scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))

#Scale feature
scaled_feature =minmax_scale.fit_transform(feature)


In [None]:
scaled_feature

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

In this, the Output array that the feature has been successfully rescaled to between 0 and 1 scikit-learn's MinMaxScaler offers two options to rescale a feature. One option is to use **fit to calculate the minimum and maximum** values of the feature, then use **trans form  to rescaale the feature**. The second option is to use fit_transform  to both operations at once. There is no mathematical difference between the two options, but there is sometimes a practical benefit to keeping the operations separate because it allows us to apply the same tranformation to different sets of the data.

##**Standardizing a Feature**
  You want to transform a feature to have a mean of  0 and a standard deviation of<br>
  1.Scikit-learn's **StandardScaler** performs both transformations

In [2]:
# Load Libraries
import numpy as np
from sklearn import preprocessing
#Create Feature
x= np.array([[-1000.1],
        [-200.2],
        [500.5],
        [600.6],
        [9000.9]])
#Create Scaler
scaler = preprocessing.StandardScaler()
#Transform the feature
standardized = scaler.fit_transform(x)

In [None]:
standardized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

The Standardization is used to transform the data such that it has a **mean ,x̄,of 0** and a **Standard Deviatin,σ,of 1**.The tranformed feature represents the number fo standard deviations the original value is away from the feature's mean value(**also called Z-score in statistics**). Standardization is a common go-to scalling  method for the machine learning preprocessing and it is mostly used than the min-max scalling. But it purely depends on the learning algorithm. For example Principal component analysis(**PCA**) often works better using standardization, while min-max scaling is often recommended for neural networks. As a general rule,use Standarizationlk unless you have a specific reason to use an alternative.The standardization method have few side Effects and it is solved by RobustScaler method. The following demonstration show:

In [None]:
#Print mean and standard Deviation
print("Mean:",round(standardized.mean()))
print("Standard deviation:", standardized.std()) # if mean =0 and standard deviation = 1 then the rescaling done is correct 

Mean: 0
Standard deviation: 1.0


If our data has significant outliers, it can negatively impact our standardization by affecting by the feature's **mean and variance**. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. 

In [3]:
#create Scaler
robust_scaler= preprocessing.RobustScaler()
#Transform feature
robust_scaler.fit_transform(x)

array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

##**Normalizing Observation**

It is mostly used in Text Classification.In Normalizing Observation we will rescale the feature values of observations to have **unit norm(a total lenght of 1)**

In [7]:
import sklearn
#use NORMALIZER with a norm argument
#Load Libraries 
from sklearn.preprocessing import Normalizer
#Create feature Matrix
features = np.array([[0.5, 0.5], #their sum should be 1 after normalizing
                    [1.1, 3.4],
                    [1.5, 20.2],
                    [1.63, 34.4],
                    [10.9, 3.3]])
#Create Normalizer
normalizer = Normalizer(norm="l2") # norm = l2 means it refers the euclidean distance # default the paramete is norm="l2" 
#Transform feature matrix
normalizer.transform(features)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

Many rescaling methdod(eg.min-max scaling and standardization) operate on features: however, we can also rescale across individual observations. **Normalizer** rescales the values on individual observations to have unit norm(the sum of their lengths is 1).This type of rescaling is often used when we have many equivalent features(eg. text classification when every word or n-word group is feature). Normalizer provides three norm options with **Euclidean norm(often called L2)** being the default argument:Squre root of [x(2)]^2 +..[x(n)]^2 where x is an individual observation an xn is that observation of nth feature

In [8]:
#Transform feature matrix
features_l2_norm =Normalizer(norm="l2").transform(features)
#show feature matrix
features_l2_norm

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

Alternatively , we can specify **Manhattan** norm**(L1)**<br>
Formula<br>
summation of modulus of all features upto nth features

In [11]:
#Transform features matrix 
features_l1_norm = Normalizer(norm="l1").transform(features)
#show features matrix
features_l1_norm

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

Intuitively, L2 norm can be thought of as the distance between two points in New York for a bird(i.e., a straight line), while L1 can be thought of as the distance for human walking on the street(walk north one block,east one block, north one block, east one block, etc.), which is why it is called **"Manhattan norm"** or **"Taxicab norm"**. Practically, notice that norm="l1" rescales on observation's values so they sum to 1, which can sometimes be a desirable quality:

In [13]:
print("Sum of the first observation\'s values:",
features_l1_norm[0,0]+ features_l1_norm[0,1])

Sum of the first observation's values: 1.0


##**Generating Polynomial And Interaction Features**

In [19]:
#Load  Libraries 
import  numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [20]:
features = np.array([[2,3],
                      [2,3],
                      [2,3]])


In [21]:
#Create PolynomialFeatures Object
polynomial_interaction =PolynomialFeatures(degree=2, include_bias=False)
#create Polynomial features
polynomial_interaction.fit_transform(features)   # output willl be x1,x2,x1^2,x1*x2,x2^2

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

The degree parameter determines the maximum degree of the polynomial.For Example, degree=2 will create new features raised to the second power:x1,x2,x1^2,x2^2  while degree =3 will create new features raised to the second and third power: x1,x2,x1 2,x2 2,x1 3, x2 3 Furthermore , by default **PolynomialFeatures** includes **interaction features:x1x2**<br>
We can restrict the features created to only interaction features by setting **interaction_only** to True

In [22]:
interaction =PolynomialFeatures(degree=2,
interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

##Explanation<br>
Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target.For example, we might suspect that the effect of an age on the probability of having a major medical condition is not constant over time but increases as age increases. We can encode that nonconstant effect in a feature , x, by generating that feature's higher-order forms (x2,x3,etc.).Additionaly, often we run into situations where the effect of one feature is dependent on another feature. A simple example would be if we were trying to predict whether or not our coffee was sweet and we had two features:1) whether or not the coffee was stirred and 2) if we added sugar. Individually, each feature does not predict coffee sweetness, but the combination of their effects does. That is, a coffee would only be sweet if the coffee had sugar and was stirred. The effects of each feature on the target(sweetness) are dependent on each other. We can encode that relationshiop by including an interaction features that is the product of the individual features

##**Transforming Features**<br>
This is Custom function for rescaling

In [1]:
#Load Libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# Create feature matrix
features = np.array([[2, 3],
                    [2, 3],
                    [2, 3]])
#Define a simple function
def add_ten(x):
  return x+10

#create transformer
ten_transformer =FunctionTransformer(add_ten)
#Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

We can create the same transformation in pandas using apply

In [4]:
#Load Library
import pandas as pd

In [7]:
#Create DataFrame
df = pd.DataFrame(features, columns=["feature_1","feature_2"])
#Apply function
df.apply(add_ten)

Unnamed: 0,feature_1,feature_2
0,12,13
1,12,13
2,12,13


##**Detecting Outliers**

Outilers are the abnormally or most deviated values in the dataset. A common method is to assume the data is normally distributed and based on that assumption "draw" an ellipse around the data, classifying any observation inside the ellipse as an inlier(labeled as 1) and any observation outside the ellipse as an outlier(labeled as -1) 

In [8]:
#Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
#Create simulated data
features,_ =make_blobs(n_samples =10,n_features=2, centers=1, random_state=1)
#Replace the firest observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000
outlier_detector = EllipticEnvelope(contamination=.1)
#Fit detector
outlier_detector.fit(features)
#Predict outliers
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

In [9]:
features # first element in the array is outlier which we added

array([[ 1.00000000e+04,  1.00000000e+04],
       [-2.76017908e+00,  5.55121358e+00],
       [-1.61734616e+00,  4.98930508e+00],
       [-5.25790464e-01,  3.30659860e+00],
       [ 8.52518583e-02,  3.64528297e+00],
       [-7.94152277e-01,  2.10495117e+00],
       [-1.34052081e+00,  4.15711949e+00],
       [-1.98197711e+00,  4.02243551e+00],
       [-2.18773166e+00,  3.33352125e+00],
       [-1.97451969e-01,  2.34634916e+00]])

A major limitation of this approach is the need to specify a contamination parameter, which is the proportion of observations that are outliers-a value that we dont know. Think of contamination as our estimate of the cleanliness of our data. If we expect our data to have few outliers, we can set contamination to something small. However, if we believe that the data is very likely to have outliers, we can set it to a higher value. Instead of looking at observations as a whole, we can instead look at individual features and identify extreme values in those features using **InterQuartile Range (IQR)**

In [11]:
#Create one feature
# Create a function to return index of outliers
def indicies_of_outliers(x):
  q1,q3 =np.percentile(x,[25,75])
  iqr =q3-q1
  lower_bound= q1-(iqr*1.5)
  upper_bound =q3+(iqr *1.5)
  return np.where((x>upper_bound)| (x<lower_bound))
#Run function
indicies_of_outliers(features)

(array([0, 0]), array([0, 1]))

IQR is the difference between the first and third quartile of a set of data. You can think of IQR as the spread of the bulk of the data, with outliers being observations far from the main concentration of data. Outliers are commonly defined as any value 1.5 IQRs less than the first quartile or 1.5 IQRs greater than the third quartile.

##**Handling Outliers**

Typically we have three strategies we can use to handle outliers

First, we can drop them

In [12]:
#Load Library
import pandas as pd
#Create DataFrame 
houses =pd.DataFrame()
houses['Price']=[534433,392333,293222,4322032]
houses['Square_Feet'] =[1500,2500,1500,48000]
houses['Bathrooms']=[2,3.5,2,116]
#Filter Observations
houses[houses['Bathrooms']<20]

Unnamed: 0,Price,Square_Feet,Bathrooms
0,534433,1500,2.0
1,392333,2500,3.5
2,293222,1500,2.0


Secondly, we can mark them as outliers and include it as a feature

In [14]:
#Load Library
import numpy as np
#create feature based on boolean condition
houses['Outliers']=np.where(houses['Bathrooms']<20,0,1)
#show data
houses

Unnamed: 0,Price,Square_Feet,Bathrooms,Outliers
0,534433,1500,2.0,0
1,392333,2500,3.5,0
2,293222,1500,2.0,0
3,4322032,48000,116.0,1


Finally, we can transform the feature to **Dampen the effect** of the outlier

In [15]:
#Log feature
houses['Log_of_Square_Feet'] =[np.log(x) for x in houses['Square_Feet']]
#show data
houses


Unnamed: 0,Price,Square_Feet,Bathrooms,Outliers,Log_of_Square_Feet
0,534433,1500,2.0,0,7.31322
1,392333,2500,3.5,0,7.824046
2,293222,1500,2.0,0,7.31322
3,4322032,48000,116.0,1,10.778956


##**Discretizating Features**

Depending on how we want to break up the data up the data, there are two techniques we can use. First, we can binarize the feature according to some threshold. By using this Method , we can convert the numerical data into the categorical data (with the use of threshold values)

In [25]:
# Load libraries
import numpy as np
from sklearn.preprocessing import Binarizer
# Create feature
age = np.array([[6],
                [12],
                [20],
                [36],
                [65]])
# Create binarizer
binarizer= Binarizer(threshold=18)
# Transform feature
binarizer.fit_transform(age)

array([[0],
       [0],
       [1],
       [1],
       [1]])

Second, We can break up numerical features according to **multiple thresholds**

In [26]:
#Bin feature
np.digitize(age,bins=[20,30,64])

array([[0],
       [0],
       [1],
       [2],
       [3]])

In [28]:
#Bin feature
np.digitize(age, bins=[20,30,64],right =True) # defalult righ= false , while changing it true, that does not make sudden difference

array([[0],
       [0],
       [0],
       [2],
       [3]])

##**Grouping Observations Using Clustering**(K-Means Clustering algorthim)

In [35]:
#Load libraries
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

#Make simulated feature matrix
features,_ = make_blobs(n_samples =50, n_features=2, centers=3,random_state=1)

#Create DataFrame
dataframe =pd.DataFrame(features, columns=['feature_1','feature_2'])

#Make K_Means cluster
clusterer = KMeans(3,random_state=0)

#Fit clusterer
clusterer.fit(features)

#Predict values
dataframe['group'] =clusterer.predict(features)

#View first few observations
dataframe=dataframe.sample(frac=1)
dataframe.head(5)

Unnamed: 0,feature_1,feature_2,group
12,-6.749247,-10.175429,2
40,-6.904845,-7.277059,2
41,-1.617346,4.989305,1
49,-7.684883,-7.455196,2
45,-9.509194,-4.02892,0


##**Deleting the Observation with Missing Values**

We need to delete the observations containing missing vaulues

In [1]:
# Load library
import numpy as np
# Create feature matrix
features = np.array([[1.1, 11.1],
                    [2.2, 22.2],
                    [3.3, 33.3],
                    [4.4, 44.4],
                    [np.nan, 55]])
# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])


Alternatively, we can drop missing observations using pandas

In [2]:
# Load library
import pandas as pd
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4
