<a href="https://colab.research.google.com/github/Lio-cmr/-python-practicals-/blob/main/Data_processing_lectur.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Standardization

When we use an algorithm to fit our data it assumes that the data is centered and the order of variance of all features are the same otherwise the estimators will not predict correctly.

The sklearn library has a method to standardize the data set with StandardScaler in preprocessing class.



In [1]:
#Before modeling our estimator we should always some preprocessing scaling.
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

NameError: ignored

# Scaling with sparse data and outliers

# 1- Scaling with Sparse data:

Scaling of data is another way of making feature values be in some range of “0” and “1”. There are two methods of doing these i.e. MinMaxScaler and MaxAbsScaler.

Example with python

In [4]:
import numpy as np
X_train = np.array([[ 1., 0.,  2.], [ 2.,  0.,  -1.], [ 0.,  2.,
                                                             -1.]])

from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)


[[0.5 0.  1. ]
 [1.  0.  0. ]
 [0.  1.  0. ]]


# 2-Scaling with Outliers:

When raw data have many outliers then the scaling with mean and variance doesn’t do well with the data. So, we have to use a more robust method like the interquartile method (IQR) because the outliers are influenced by mean and variance. The range of the IQR is between 25% and 75% in which the median is removed and scaling the quantile range.

The RobustScaler takes some parameters to perform scaling.


-The first parameter is with_centering that centers the data before scaling if it is true.

-The second parameter is with_scaling if it is true then it scale the data in the quantile range.


Example with python

In [5]:
from sklearn.preprocessing import RobustScaler
X = [[ 1., 0.,  2.], [ 2.,  0.,  -1.], [ 0.,  2., -1.]]
transformer = RobustScaler().fit(X)

transformer.transform(X)


array([[ 0.,  0.,  2.],
       [ 1.,  0.,  0.],
       [-1.,  2.,  0.]])

# Normalization

The scaling process in this is to normalize the values to their unit norm. An example of this normalization is MinMaxScaler. The process is useful when we are dealing with quadratic form in pair forms it can be kernel-based or dot product-based.

It is also useful based on of vector space model i.e the vectors related with text data samples to ease in data filtration.

Two types of Normalization happen as shown below:

- Normalize: It deals to scale the input vectors to unit norm. The norm parameter is used to normalize all the non-zero values. It takes three arguments L1, L2, and max where the L2 is the default norm.

- Normalizer: It also does the same operation but in this process the fit method is optional.

Example with Python:



In [6]:
from sklearn.preprocessing import normalize
X = [[ 1., 0., 2.], [ 2., 0., -1.], [ 0., 2., -1.]]
X_normalized = normalize(X, norm='l2')
print(X_normalized)

[[ 0.4472136   0.          0.89442719]
 [ 0.89442719  0.         -0.4472136 ]
 [ 0.          0.89442719 -0.4472136 ]]


Example with Normalizer:

The normalizer is useful in the pipeline of data processing in the beginning.

When we use sparse input it is important to convert it not CSR format to avoid multiple memory copies. The CSR is compressed Sparse Rows comes in scipy.sparse.csr_matrix.

# Categorical Encoding

When we get some raw data set then some columns are that are not in continuous values rather in some categories of binary and multiple categories. So, to make them in integer value we use encoding methods. There are some encoding methods given below:

- **Get Dummies:** It is used to get a new feature column with 0 and 1 encoding the categories with the help of the pandas’ library.

- **Label Encoder:** It is used to encode binary categories to numeric values in the sklearn library.

- **One Hot Encoder:** The sklearn library provides another feature to convert categories class to new numeric values of 0 and 1 with new feature columns.

- **Hashing:** It is more useful than one-hot encoding in the case of high dimensions. It is used when there is high cardinality in the feature.

There are many other encoding methods like **mean encoding, Helmert encoding, ordinal encoding**, probability ratio encoding and, etc.

Example with Python:

In [9]:
import pandas as pd
df1=pd.get_dummies(df['State'],drop_first=True)

NameError: ignored

# Imputation

when raw data have some missing values so to make the missing record to a numeric value is know as imputing.

Creating the random data frame.



In [10]:
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'c', 'e',
'h'],columns=['First', 'Second', 'Three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)

      First    Second     Three
a -0.857874  0.914767  0.113066
b       NaN       NaN       NaN
c  0.671168 -0.706969 -0.362725
d       NaN       NaN       NaN
e  1.111266 -0.171088  0.257602
f       NaN       NaN       NaN
g       NaN       NaN       NaN
h -1.803207  0.442516  0.939384


Now replacing with zero value.

In [11]:
print ("NaN replaced with '0':")
print (df.fillna(0))

NaN replaced with '0':
      First    Second     Three
a -0.857874  0.914767  0.113066
b  0.000000  0.000000  0.000000
c  0.671168 -0.706969 -0.362725
d  0.000000  0.000000  0.000000
e  1.111266 -0.171088  0.257602
f  0.000000  0.000000  0.000000
g  0.000000  0.000000  0.000000
h -1.803207  0.442516  0.939384


Replacing the missing values with mean.

In [14]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')


The sklearn provide simple imputer to find the NAN values and fill with mean.

We can use imputer in the pipeline to make an estimator better.

#Conclusion

The data preprocessing is an important step to perform to make the data set more reliable to our estimators.