# Feature engineering

## Centering data

Data centering involves subtracting the mean of a data set from each data point so that the new mean is 0. This process helps us understand how far above or below each of our data points is from the mean.

In [None]:
# Find the mean of a dataset

original_data = data.column
mean_data = np.mean(mean_data)

centered_data = original_data - mean_data

## Standardising data

Standardisation (Z-Score normalisation) makes all features to be the same scale. This step is critical because some machine learning models will treat all features equally regardless of their scale. You’ll definitely want to standardize your data in the following situations:

* Before Principal Component Analysis
* Before using any clustering or distance based algorithm (think KMeans or DBSCAN)
* Before KNN
* Before performing regularization methods like LASSO and Ridge

To standardise we center our data, then divide it by the standard deviation.

z = (value - mean) / stdev

In [None]:
distance = coffee['nearest_starbucks']

#find the mean of our feature
distance_mean = np.mean(distance)

#find the standard deviation of our feature
distance_std_dev = np.std(distance)
    
#this will take each data point in distance subtract the mean, then divide by the standard deviation
distance_standardized = (distance - distance_mean) / distance_std_dev

# print what type distance_standardized is
print(type(distance_standardized))
#output = <class 'pandas.core.series.Series'>

#print the mean
print(np.mean(distance_standardized))
#output = 7.644158530205996e-17

#print the standard deviation
print(np.std(distance_standardized))
#output = 1.0000000000000013

# Our outputs are basically mean = 0 and standard deviation = 1

## Standardising with Sklearn

We instantiate the StandardScaler by setting it to a variable called scaler which we can then use to transform our feature. The next step is to reshape our distance array. StandardScaler must take in our array as 1 column, so we’ll reshape our distance array using the .reshape(-1,1) method. This numpy method says to take our data and give it back to us as 1 column, represented in the second value. The -1 asks numpy to figure out the exact number of rows to create based on our data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

reshaped_distance = np.array(distance).reshape(-1,1)

distance_scaler = scaler.fit_transform(reshaped_distance)

In [None]:
print(np.mean(distance_scaler))
#output = -9.464196275493137e-17
print(np.std(distance_scaler))
#output = 0.9999999999999997
1

## Min-max normalisation

nother form of scaling your data is to use a min-max normalization process. The name says it all, we find the minimum and maximum data point in our entire data set and set each of those to 0 and 1, respectively. Then the rest of the data points will transform to a number between 0 and 1, depending on its distance between the minimum and maximum number. We find that transformed number by taking the data point subtracting it from the minimum point, then dividing by the value of our maximum minus minimum.

Mathematically a min-max normalization looks like this:

Xnorm = (X - Xmin) / (Xmax - Xmin)

One thing to note about min-max normalization is that this transformation does not work well with data that has extreme outliers. You will want to perform a min-max normalization if the range between your min and max point is not too drastic.

The reason we would want to normalize our data is very similar to why we would want to standardize our data - getting everything on the same scale.

In [None]:
distance = coffee['nearest_starbucks']

#find the min value in our feature
distance_min = np.min(distance)

#find the max value in our feature
distance_max = np.max(distance)

#normalize our feature by following the formula
distance_normalized = (distance - distance_min) / (distance_max - distance_min)

In [None]:
# With sklearn

from sklearn.preprocessing import MinMaxScaler

mmscaler = MinMaxScaler()

#get our distance feature
distance = coffee['nearest_starbucks']

#reshape our array to prepare it for the mmscaler
reshaped_distance = np.array(distance).reshape(-1,1)

#.fit_transform our reshaped data
distance_norm = mmscaler.fit_transform(reshaped_distance)

#see unique values
print(set(np.unique(distance_norm)))
#output = {0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0}

## Binning Data

Binning data is the process of taking numerical or categorical data and breaking it up into groups. We could decide to bin our data to help capture patterns in noisy data. There isn’t a clean and fast rule about how to bin your data, but like so many things in machine learning, you need to be aware of the trade-offs.

You want to make sure that your bin ranges aren’t so small that your model is still seeing it as noisy data. Then you also want to make sure that the bin ranges are not so large that your model is unable to pick up on any pattern. It is a delicate decision to make and will depend on the data you are working with.


For example, our data has a range of 0 km to 8km. I wonder how our data would transform if we were to bin our data in the following way:

distance < 1km

1.1km <= distance < 3km

3.1km <= distance < 5km

5.1km <= distance

In [None]:
bins = [0, 1, 3, 5, 8.1]

coffee['binned_distance'] = pd.cut(coffee['nearest_starbucks'], bins, right = False)


We have 8.1 and not 8 because the pandas function we will use pd.cut() has a parameter where it will include the lower bound, and excludes the upper bound. 

In [None]:
print(coffee[['binned_distance', 'nearest_starbucks']].head(3))

#output
#  binned_distance  nearest_starbucks
#0      [5.0, 8.1)                  8
#1      [5.0, 8.1)                  8
#2      [5.0, 8.1)                  8

# Plot the bar graph of binned distances
coffee['binned_distance'].value_counts().plot(kind='bar')
 
# Label the bar graph 
plt.title('Starbucks Distance Distribution')
plt.xlabel('Distance')
plt.ylabel('Count') 
 
# Show the bar graph 
plt.show()

## Natural Log Transformation

Logarithms are an essential tool in statistical analysis and machine learning preparation. This transformation works well for right-skewed data and data with large outliers. After we log transform our data, one large benefit is that it will allow the data to be closer to a “normal” distribution. It also changes the scale so our data points will drastically reduce the range of their values.

In [None]:
import numpy as np

#perform the log transformation
log_car = np.log(cars['odometer'])

#graph our transformation
plt.hist(log_car, bins = 200, color = 'g')

#rotate the x labels so we can read it easily
plt.xticks(rotation = 45)

#provide a title
plt.title('Logarithm of Car Odometers')
plt.show();


Using a log transformation in a machine learning model will require some extra interpretation. For example, if you were to log transform your data in a linear regression model, our independent variable has a multiplication relationship with our dependent variable instead of the usual additive relationship we would have if our data was not log-transformed.

Keep in mind, just because your data is skewed does not mean that a log transformation is the best answer. You would not want to log transform your feature if:
1. You have values less than 0. The natural logarithm (which is what we’ve been talking about) of a negative number is undefined.
2. You have left-skewed data. That data may call for a square or cube transformation.
3. You have non-parametric data.