<a href="https://colab.research.google.com/github/Gabrielsandbox/AI-ML-Codebase/blob/main/FeatureEngineering_Numerical_Transform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

# **Data centering**

In [None]:
# Data centering involves subtracting the mean of a data set from each data point so that the new mean is 0.
# This process helps us understand how far above or below each of our data points is from the mean.
#
#Example:
#get the mean of your feature
mean_dis = np.mean(distance)

#take our distance array and subtract the mean_dis, this will create a new series with the results
centered_dis = distance - mean_dis

#visualize your new list
plt.hist(centered_dis, bins = 5, color = 'g')

#label our visual
plt.title('Starbucks Distance Data Centered')
plt.xlabel('Distance from Mean')
plt.ylabel('Count')
plt.show();

# **Standardization (also known as Z-Score normalization)**

In [None]:
# Center our data, then divide it by the standard deviation.
# Once we do that, our entire data set will have a mean of zero and a standard deviation of one.
# This allows all of our features to be on the same scale.
#
#Example:

distance = coffee['nearest_starbucks']

#find the mean of our feature
distance_mean = np.mean(distance)

#find the standard deviation of our feature
distance_std_dev = np.std(distance)

#this will take each data point in distance subtract the mean, then divide by the standard deviation
distance_standardized = (distance - distance_mean) / distance_std_dev

# print what type distance_standardized is
print(type(distance_standardized))
#output = <class 'pandas.core.series.Series'>

#print the mean
print(np.mean(distance_standardized))
#output = 7.644158530205996e-17 = close to 0

#print the standard deviation
print(np.std(distance_standardized))
#output = 1.0000000000000013


# Note:
# This step is critical because some machine learning models will treat all features equally regardless of their scale.
# You’ll definitely want to standardize your data in the following situations:

# Before Principal Component Analysis
# Before using any clustering or distance based algorithm (think KMeans or DBSCAN)
# Before KNN
# Before performing regularization methods like LASSO and Ridge

# **Data Standardization with Sklearn**

In [None]:
#Example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
reshaped_distance = np.array(distance).reshape(-1,1)
distance_scaler = scaler.fit_transform(reshaped_distance)

print(np.mean(distance_scaler))
#output = -9.464196275493137e-17
print(np.std(distance_scaler))
#output = 0.9999999999999997

# **Min-Max Normalization**

In [None]:
# We find the minimum and maximum data point in our entire data set and set each of those to 0 and 1, respectively.
# Then the rest of the data points will transform to a number between 0 and 1,
# depending on its distance between the minimum and maximum number.
#
# This transformation does not work well with data that has extreme outliers.
# You will want to perform a min-max normalization if the range between your min and max point is not too drastic.
#
#Example:

distance = coffee['nearest_starbucks']

#find the min value in our feature
distance_min = np.min(distance)

#find the max value in our feature
distance_max = np.max(distance)

#normalize our feature by following the formula
distance_normalized = (distance - distance_min) / (distance_max - distance_min)

# **Min-Max Normalization with Sklearn**

In [None]:
#Example:

from sklearn.preprocessing import MinMaxScaler

mmscaler = MinMaxScaler()

#get our distance feature
distance = coffee['nearest_starbucks']

#reshape our array to prepare it for the mmscaler
reshaped_distance = np.array(distance).reshape(-1,1)

#.fit_transform our reshaped data
distance_norm = mmscaler.fit_transform(reshaped_distance)

#see unique values
print(set(np.unique(distance_norm)))
#output = {0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0}

# **Data Binning**

In [None]:
# Binning data is the process of taking numerical or categorical data and breaking it up into groups.
# We could decide to bin our data to help capture patterns in noisy data.

# You want to make sure that your bin ranges aren’t so small that your model is still seeing it as noisy data.
# Then you also want to make sure that the bin ranges are not so large that your model is unable to pick up on any pattern.
# It is a delicate decision to make and will depend on the data you are working with.

#First, set the upper boundaries
bins = [0, 1, 3, 5, 8.1]

coffee['binned_distance'] = pd.cut(coffee['nearest_starbucks'], bins, right = False)

print(coffee[['binned_distance', 'nearest_starbucks']].head(3))

#output
#  binned_distance  nearest_starbucks
#0      [5.0, 8.1)                  8
#1      [5.0, 8.1)                  8
#2      [5.0, 8.1)                  8
# Plot the bar graph of binned distances

coffee['binned_distance'].value_counts().plot(kind='bar')

# Label the bar graph
plt.title('Starbucks Distance Distribution')
plt.xlabel('Distance')
plt.ylabel('Count')

# Show the bar graph
plt.show()

# **Natural Log Transformation**

In [None]:
# Logarithms are an essential tool in statistical analysis and machine learning preparation.
# This transformation works well for right-skewed data and data with large outliers.
# After we log transform our data, one large benefit is that it will allow the data to be closer to a “normal” distribution.
# It also changes the scale so our data points will drastically reduce the range of their values.

# Example:

import numpy as np

#perform the log transformation
log_car = np.log(cars['odometer'])

#graph our transformation
plt.hist(log_car, bins = 200, color = 'g')

#rotate the x labels so we can read it easily
plt.xticks(rotation = 45)

#provide a title
plt.title('Logarithm of Car Odometers')
plt.show();

# Note:

'''
 When a histogram is right-skewed, where the majority of our data is located on the left side of our graph,
 if we were to provide this feature to our machine learning model it will see a lot of different data points with readings off on the left of our graph.
 It will not see a lot of examples with very high readings.
 This may cause issues with our model, as it may struggle to pick up on patterns that are within those examples off on the right side of our histogram.
 So we log transform.

 Using a log transformation in a machine learning model will require some extra interpretation.
 For example, if you were to log transform your data in a linear regression model,
 our independent variable has a multiplication relationship with our dependent variable
 instead of the usual additive relationship we would have if our data was not log-transformed.
 Keep in mind, just because your data is skewed does not mean that a log transformation is the best answer.
 You would not want to log transform your feature if:
 1 - You have values less than 0. The natural logarithm (which is what we’ve been talking about) of a negative number is undefined.
 2 - You have left-skewed data. That data may call for a square or cube transformation.
 3 - You have non-parametric data