# Feature Scaling

### We know that Feature Scaling is important for Machine Learning. It is very Important for Machine Learning Algorithms. This helps in keeping the Features of the data in a similar scale. This not only enables easy visualization but also speeds up the algorithm

### Loading Data1

In [29]:
import pandas as pd

In [30]:
data1 = pd.read_csv('data1.txt')
print(data1.head())

   size  no_rooms    cost
0  2104         3  399900
1  1600         3  329900
2  2400         3  369000
3  1416         2  232000
4  3000         4  539900


### Clearly we can see that size(in sq_feet) of the house and no. of rooms in the house are not on a similar Scale.
### So we perform Feature Scaling.                                                                                                                                  The last column is actually the Y value(cost) we don't scale that.



In [31]:
X = data1.drop(['cost'],1)
print(X.head())

   size  no_rooms
0  2104         3
1  1600         3
2  2400         3
3  1416         2
4  3000         4


In [32]:
import numpy as np
Xa = np.array(X)
print(Xa[:10])

[[2104    3]
 [1600    3]
 [2400    3]
 [1416    2]
 [3000    4]
 [1985    4]
 [1534    3]
 [1427    3]
 [1380    3]
 [1494    3]]


### SCALING DATA

In [33]:
from sklearn import preprocessing

In [34]:
X_scaled = preprocessing.scale(Xa)

In [35]:
print(X_scaled[:10])

[[ 0.13141542 -0.22609337]
 [-0.5096407  -0.22609337]
 [ 0.5079087  -0.22609337]
 [-0.74367706 -1.5543919 ]
 [ 1.27107075  1.10220517]
 [-0.01994505  1.10220517]
 [-0.59358852 -0.22609337]
 [-0.72968575 -0.22609337]
 [-0.78946678 -0.22609337]
 [-0.64446599 -0.22609337]]


### Getting Mean and Variance of scaled data

In [36]:
X_scaled.mean(axis=0)

array([9.44870659e-18, 2.71059770e-16])

In [37]:
X_scaled.std(axis=0)

array([1., 1.])

###  From above now we can say that the scaled Data has mean approx 0 and Variance=1(std. Deviation=1)

### STANDARD SCALER

In [69]:
X_std_scaled = preprocessing.StandardScaler()
X_std_scaled.fit(Xa)

StandardScaler(copy=True, with_mean=True, with_std=True)

#### Getting Mean and Scale

In [71]:
print(X_std_scaled.mean_)

[2000.68085106    3.17021277]


In [72]:
print(X_std_scaled.scale_)

[7.86202619e+02 7.52842809e-01]


#### Displaying Scaled Values

In [73]:
X_std = X_std_scaled.transform(Xa)
print(X_std[:10])

[[ 0.13141542 -0.22609337]
 [-0.5096407  -0.22609337]
 [ 0.5079087  -0.22609337]
 [-0.74367706 -1.5543919 ]
 [ 1.27107075  1.10220517]
 [-0.01994505  1.10220517]
 [-0.59358852 -0.22609337]
 [-0.72968575 -0.22609337]
 [-0.78946678 -0.22609337]
 [-0.64446599 -0.22609337]]


### SCALING FEATURES TO A RANGE

In [74]:
X_min_max = preprocessing.MinMaxScaler()
X_min_max.fit(Xa)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [76]:
X_MinMax_scaled = X_min_max.transform(Xa)
X_MinMax_scaled[:10]

array([[0.34528406, 0.5       ],
       [0.20628792, 0.5       ],
       [0.42691671, 0.5       ],
       [0.1555433 , 0.25      ],
       [0.59238831, 0.75      ],
       [0.31246553, 0.75      ],
       [0.18808605, 0.5       ],
       [0.15857694, 0.5       ],
       [0.145615  , 0.5       ],
       [0.17705461, 0.5       ]])

## NORMALIZATION

### Each feature vector will have norm=1

In [94]:
N1 = preprocessing.normalize(Xa,norm='l2')
print(N[:10])

[[0.99857617 0.00142383]
 [0.99812851 0.00187149]
 [0.99875156 0.00124844]
 [0.99858956 0.00141044]
 [0.99866844 0.00133156]
 [0.99798894 0.00201106]
 [0.99804815 0.00195185]
 [0.9979021  0.0020979 ]
 [0.9978308  0.0021692 ]
 [0.99799599 0.00200401]]


In [95]:
N2 = preprocessing.normalize(Xa,norm='l1')
print(N[:10])

[[0.99857617 0.00142383]
 [0.99812851 0.00187149]
 [0.99875156 0.00124844]
 [0.99858956 0.00141044]
 [0.99866844 0.00133156]
 [0.99798894 0.00201106]
 [0.99804815 0.00195185]
 [0.9979021  0.0020979 ]
 [0.9978308  0.0021692 ]
 [0.99799599 0.00200401]]


###### Verifying Normalization

In [96]:
temp = N1[0]
print(np.linalg.norm(temp))

0.9999999999999999


### So we can say that the feature Vector for each Datapoint is now normalized to 1 (i.e they are now unit vectors)

# HANDLING NON-NUMERIC DATA

### Sometimes we are given non numeric Features. but our Algorithms Work only with Numeric Features. Thus we Encode the non Numeric Features in order to use them in our Algorithm

In [38]:
data2 = pd.read_csv('data2.txt')
print(data2.head())

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40


##  We can see that Genre is not a Numeric Data

In [39]:
enc = preprocessing.LabelEncoder()

In [40]:
enc.fit(data2['Genre'])

LabelEncoder()

In [41]:
list(enc.classes_)

['Female', 'Male']

In [42]:
enc.transform(data2['Genre'])

array([1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1])

In [43]:
data2.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [44]:
data2['Genre'] = enc.transform(data2['Genre'])

In [45]:
data2.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,1,19,15,39
1,2,1,21,15,81
2,3,0,20,16,6
3,4,0,23,16,77
4,5,0,31,17,40


### Thus we see that Male has been encoded as 1 and Female has been encoded as 2;  Now we can work with the data

## But the Label Encoder is Used for Encoding Labels.

In [47]:
data3 = pd.read_csv('data2.txt')
data3.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### ORDINAL ENCODER

In [48]:
enc2 = preprocessing.OrdinalEncoder()

In [51]:
data3 = np.array(data3)
data3[:5]

array([[1, 'Male', 19, 15, 39],
       [2, 'Male', 21, 15, 81],
       [3, 'Female', 20, 16, 6],
       [4, 'Female', 23, 16, 77],
       [5, 'Female', 31, 17, 40]], dtype=object)

In [56]:
enc2.fit(data3)

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [97]:
enc2.transform(data3[:10])

array([[ 0.,  1.,  1.,  0., 30.],
       [ 1.,  1.,  3.,  0., 67.],
       [ 2.,  0.,  2.,  1.,  4.],
       [ 3.,  0.,  5.,  1., 64.],
       [ 4.,  0., 13.,  2., 31.],
       [ 5.,  0.,  4.,  2., 63.],
       [ 6.,  0., 17.,  3.,  4.],
       [ 7.,  0.,  5.,  3., 79.],
       [ 8.,  1., 44.,  4.,  1.],
       [ 9.,  0., 12.,  4., 59.]])

##  We can Say that we have Encoded All the Features Now; This is also a good way

## ONE HOT ENCODER

In [58]:
enc3 = preprocessing.OneHotEncoder()

In [64]:
data3[:5]

array([[1, 'Male', 19, 15, 39],
       [2, 'Male', 21, 15, 81],
       [3, 'Female', 20, 16, 6],
       [4, 'Female', 23, 16, 77],
       [5, 'Female', 31, 17, 40]], dtype=object)

In [60]:
enc3.fit(data3)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=True)

In [67]:
enc3.transform(data3).toarray()

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## GENERATING POLYNOMIAL FEATURES

### Sometimes as we know ; The features we have are not enough for us to fit our data; Lack of Features gives us a High Bias Problem. Thus To get a Good Fit; We Include Polynomial Features. We also have to be careful as if we add a large no. of features, then our Algorithm can Overfit the data

In [81]:
D = np.arange(6).reshape(3,2)
D

array([[0, 1],
       [2, 3],
       [4, 5]])

In [82]:
from sklearn.preprocessing import PolynomialFeatures

In [83]:
poly  = PolynomialFeatures(2)

### As we chose degree 2; then
#### [x,y] -----> [1,x,y,x^2,xy,y^2]

In [84]:
features  = poly.fit_transform(D)
print(features)

[[ 1.  0.  1.  0.  0.  1.]
 [ 1.  2.  3.  4.  6.  9.]
 [ 1.  4.  5. 16. 20. 25.]]


## Preprocessing is not just the above things. A lot of things has not been covered. We can visit ScikitLearn Website and go through the documentation