note: *there is some information regarding handling missing data with an Imputer and Pipeline preprocessing in the "Data Manipulation repo" under the Missing Values notebook.*

### Why scale data?

- Many models use some form of distance to inform the results
- Features on larger scales can influence the model heavily towards an "artificial" bias
- This is why we would want the features to be **on a similar scale**

### Normalizing (Scaling and Centering)

- **Standardization** 
    - All features are centered around zero, and have a variance of 1
- can also create a dataset to range from zero to 1
- or data ranges from -1 to 1

In [None]:
from sklearn.preprocessing import scale
X_scaled = scale(X)

In [None]:
np.mean(X), np.std(X)

In [None]:
np.mean(X_scaled), np.std(X_scaled)

#### Putting a Scaler in a Pipeline Object

In [None]:
from sklearn.preprocessing import StandardScaler

steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size = 0.2, 
                                                   random_state = 21)

knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

#### CV and Scaling in a Pipeline

In [None]:
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

parameters = {knn__n_neighbors: np.arange(1, 50)}

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size = 0.2, 
                                                   random_state = 21)

cv = GridSearchCV(pipeline, param_grid = parameters)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

print(cv.best_params_)
print(cv.score(X_test, y_test))
print(classification_report(y_test, y_pred))

#### Scaling

In [None]:
# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
