#### Scikit Learn
#### Date 1/18/2024


#### 1.Preprocessing Data
###### -In general, many learning algorithms such as linear models benefit from standardization of the data set
###### -If some outliers are present in the set, robust scalers or other transformers can be more appropriate. 
###### -might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
#### 1.1 StandardScaler

In [24]:
from sklearn import preprocessing
import numpy as np 

X_train=np.array([[1,-2,2],
                 [2,0,0],
                 [0,1,-1]])

###### -The fit(data) method is used to compute the mean and std dev for a given feature so that it can be used further for scaling.
###### -The transform(data) method is used to perform scaling using mean and std dev calculated using the .fit() method.
###### -The fit_transform() method does both fit and transform.

In [25]:
scaler=preprocessing.StandardScaler()
scaler

###### -If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well.
###### -By using RobustScaler(), we can remove the outliers and then use either StandardScaler or MinMaxScaler for preprocessing the dataset.

In [26]:
modal=scaler.fit(X_train)

In [27]:
modal.transform(X_train)

array([[ 0.        , -1.33630621,  1.33630621],
       [ 1.22474487,  0.26726124, -0.26726124],
       [-1.22474487,  1.06904497, -1.06904497]])

#### 1.2 MinMaxScaler
###### -MaxAbsScaler is similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present. If only positive values are present, the range is [0, 1]. If only negative values are present, the range is [-1, 0]. If both negative and positive values are present, the range is [-1, 1]. On positive only data, both MinMaxScaler and MaxAbsScaler behave similarly. MaxAbsScaler therefore also suffers from the presence of large outliers.

In [29]:
preprocessing.MinMaxScaler().fit_transform(X_train)

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.66666667, 0.33333333],
       [0.        , 1.        , 0.        ]])

#### 1.3 RobustScaler


In [30]:
preprocessing.RobustScaler().fit_transform(X_train)

array([[ 0.        , -1.33333333,  1.33333333],
       [ 1.        ,  0.        ,  0.        ],
       [-1.        ,  0.66666667, -0.66666667]])

###### -Unlike the previous scalers, the centering and scaling statistics of RobustScaler are based on percentiles and are therefore not influenced by a small number of very large marginal outliers.

#### 1.4 Encoding categorical values

In [32]:
x=[['male','from Us','uses chrome'],['female','from India','uses edge']]

In [38]:
encoder=preprocessing.OrdinalEncoder()

In [39]:
encoder.fit(x)

In [41]:
encoder.transform([['female','from Us','uses edge']])

array([[0., 1., 1.]])