# All about Feature Engineering

## 1. Feature scaling/normalizing
a. In scaling, you're changing the range of your data. You want to scale data when you're using methods based on measures of how far apart data points are, like support vector machines (SVM) or k-nearest neighbors (KNN). With these algorithms, a change of "1" in any numeric feature is given the same importance.

b. In normalization, you're changing the shape of the distribution of your data. In general, you'll normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

Why features need to scale/normalize?

a. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

b. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

c. It's also important to apply feature scaling if regularization is used as part of the loss function (so that coefficients are penalized appropriately).

## 2. Evaluate model using Cross-Valuation

In [1]:
# Import libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [None]:
# Read input data
df = pd.read_csv("file.name")
X = df.copy()
y = X.pop("column.name")

In [None]:
# Evaluate model base on Cross-Valuation scores
model = RandomForestRegressor(criterion="mae", random_state=0)
scores = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
scores = -1*scores.mean() # mean of all scores (5 scores in this case)

## 3. Apply Mutual Information to measure a relationship between two quantities

In [None]:
# Import libraries
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression


The scikit-learn algorithm for MI treats discrete features differently from continuous features. 
Consequently, you need to tell it which are which. As a rule of thumb, anything that must have a float 
dtype is not discrete. Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding.


In [None]:
# Function calculates MI scores
def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
# Function for plotting MI scores
def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

## 4. Mathematical Transforms

## Ratio
The "stroke ratio", for instance, is a measure of how efficient an engine is versus how performant:

In [None]:
autos["stroke_ratio"] = autos.stroke / autos.bore
autos[["stroke", "bore", "stroke_ratio"]].head()

## Normalize
Data visualization can suggest transformations, often a "reshaping" of a feature through powers or logarithms. The distribution of WindSpeed in US Accidents is highly skewed, for instance. In this case the logarithm is effective at normalizing it:

In [None]:
# If the feature has 0.0 values, use np.log1p (log(1+x)) instead of np.log
accidents["LogWindSpeed"] = accidents.WindSpeed.apply(np.log1p)

# Plot a comparison
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sns.kdeplot(accidents.WindSpeed, shade=True, ax=axs[0])
sns.kdeplot(accidents.LogWindSpeed, shade=True, ax=axs[1]);

## Counts
Features describing the presence or absence of something often come in sets, the set of risk factors for a disease, say. You can aggregate such features by creating a count.

These features will be binary (1 for Present, 0 for Absent) or boolean (True or False). In Python, booleans can be added up just as if they were integers.

In Traffic Accidents are several features indicating whether some roadway object was near the accident. This will create a count of the total number of roadway features nearby using the sum method:

In [None]:
roadway_features = ["Amenity", "Bump", "Crossing", "GiveWay",
    "Junction", "NoExit", "Railway", "Roundabout", "Station", "Stop",
    "TrafficCalming", "TrafficSignal"]
accidents["RoadwayFeatures"] = accidents[roadway_features].sum(axis=1) # Sum number of features (True = 1 and False = 0).

accidents[roadway_features + ["RoadwayFeatures"]].head(10)

Many formulations lack one or more components (that is, the component has a value of 0). This will count how many components are in a formulation with the dataframe's built-in greater-than gt method:

In [None]:
components = [ "Cement", "BlastFurnaceSlag", "FlyAsh", "Water",
               "Superplasticizer", "CoarseAggregate", "FineAggregate"]
concrete["Components"] = concrete[components].gt(0).sum(axis=1) # Count the components that greater than 0.

concrete[components + ["Components"]].head(10)

## Building-Up and Breaking-Down Features

 From the Policy feature, we could separate the Type from the Level of coverage:

In [None]:
customer[["Type", "Level"]] = (  # Create two new features
    customer["Policy"]           # from the Policy feature
    .str                         # through the string accessor
    .split(" ", expand=True)     # by splitting on " "
                                 # and expanding the result into separate columns
)

customer[["Policy", "Type", "Level"]].head(10)

You could also join simple features into a composed feature if you had reason to believe there was some interaction in the combination:

In [None]:
autos["make_and_style"] = autos["make"] + "_" + autos["body_style"]
autos[["make", "body_style", "make_and_style"]].head()

Convert our date columns to datetime:
1/17/07 has the format "%m/%d/%y"; 
17-1-2007 has the format "%d-%m-%Y"


In [None]:
# create a new column, date_parsed, with the parsed dates
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format="%m/%d/%y")

## Group Transforms
Finally we have Group transforms, which aggregate information across multiple rows grouped by some category.

In [None]:
customer["AverageIncome"] = (
    customer.groupby("State")  # for each state
    ["Income"]                 # select the income
    .transform("mean")         # and compute its mean
)

customer[["State", "Income", "AverageIncome"]].head(10)

Methods include max, min, median, var, std, and count. 

In [None]:
customer["StateFreq"] = (
    customer.groupby("State") # for each State
    ["State"]  # Select the state
    .transform("count") # count the number of each state
    / customer.State.count() # count total number in state column
)

customer[["State", "StateFreq"]].head(10)

If you're using training and validation splits, to preserve their independence, it's best to create a grouped feature using only the training set and then join it to the validation set. We can use the validation set's merge method after creating a unique set of values with drop_duplicates on the training set:

In [None]:
# Create splits
df_train = customer.sample(frac=0.5)
df_valid = customer.drop(df_train.index)

# Create the average claim amount by coverage type, on the training set
df_train["AverageClaim"] = df_train.groupby("Coverage")["ClaimAmount"].transform("mean")

# Merge the values into the validation set
df_valid = df_valid.merge(
    df_train[["Coverage", "AverageClaim"]].drop_duplicates(),
    on="Coverage",
    how="left",
)

df_valid[["Coverage", "AverageClaim"]].head(10)

## Tips on Creating Features
It's good to keep in mind your model's own strengths and weaknesses when creating features. Here are some guidelines:
1. Linear models learn sums and differences naturally, but can't learn anything more complex.
2. Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.
3. Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.
4. Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited.
5. Counts are especially helpful for tree models, since these models don't have a natural way of aggregating information across many features at once.