# Introduction #

After you've identified a good set of features to start developing, what kinds of things could you do with them then? That's what the remainder of this course is about. In this lesson you'll learn some transformations that can be done completely in dataframes. If you're feeling rusty, we've got a great [course on Pandas](https://www.kaggle.com/learn/pandas).

# Arithmetic Transforms #

For numeric data, arithmetic transforms are often useful. Domain knowledge can be especially useful here.

You can do computations over the columns of a dataframe just as if they were individual numbers. It's common to combine features with ordinary arithmetic operations.
```
df["Feature_1"] + df["Feature_2"]
df["Feature_1"] - df["Feature_2"]
df["Feature_1"] * df["Feature_2"]
df["Feature_1"] / df["Feature_2"]
```
Products and ratios are especially worth investigating if you had discovered any interaction effects. Ratios especially are difficult for most machine learning models to discover on their own.

Power transforms and logarithms are common ways of transforming individual features.
```
np.log(df["Feature"])  # import numpy as np
df["Feature"] ** 2
```
These kinds of transforms are often applied to reduce skewness; check out our [lesson on normalization](https://www.kaggle.com/alexisbcook/scaling-and-normalization) in *Data Cleaning* where you can also learn about the *Box-Cox transformation*.

We might want to count up the number of features satisfying some property. Examples could be a set of binary features indicating risk factors for some disease, or sensor recordings of neuronal activity, and you want to count how many neurons pass some threshold.
```
df[["Binary_1", "Binary_2", "Binary_3"]].sum(axis=1)
df[["Numeric_1", "Numeric_2", "Numeric_3"]].gt(0).sum(axis=1)
```
Tree-based models (like random forests and XGBoost) don't have a natural way of integrating information across large numbers of features at once. Counting up properties across many features could be a good idea for these kinds of models.

You can combine transformations to compute any formulas you might come across. If you had a dataframe containing the features `'AirTempF'` and `'WindSpdMPH'`, you could add a feature for the corresponding [wind-chill](https://www.weather.gov/oun/safety-winter-windchill) (in US units) like:
```
df["WindChill"] = (
    35.74
    + 0.6215 * df["AirTempF"]
    - 35.75 * df["WindSpdMPH"] ** 0.16
    + 0.4275 * df["AirTempF"] * df["WindSpdMPH"] ** 0.16
)
```

Doing a bit of research about your problem domain during feature engineering can pay off with ideas for new features. If the experts agree some feature is useful, there's a good chance your model will find it useful too!

# Building Up and Breaking Down Features #

Categorical features often show up in a dataframe as strings. You can do column-wise operations on strings using Pandas `pd.Series.str` methods. (Note that the `str` methods only work with `Series` -- that is, single columns -- and not on entire dataframes.)

Often you'll have complex strings that can usefully be broken into simpler pieces. Common examples might be: 
- ID numbers: `'123-45-6789'`
- Phone numbers: `'(999) 555-0123'`
- Street addresses: `'8241 Kaggle Ln., Goose City, NV'`
- Internet addresses: `'http://www.kaggle.com`
- Product codes: `'0 36000 29145 2'`
- Dates and times: `'Mon Sep 30 07:06:05 2013'`

These kinds of things will usually be structured in some logical way and can yield a suprising amount of information in addition to what's in the string itself. As always, some research can pay off here.

Say we had a dataframe with instances like:

| Location                  | AccountNum |
|---------------------------|------------|
| "Los Angeles, California" | 123456     |

We could decompose these using Pandas' native string methods.

```
df[["City", "State"]] = df["Location"].str.split(",", expand=True)
df["Branch"] = df["AccountNum"].str.slice(stop=3)

df.head()
```

You could also join simpler features to create a composed feature. This would help your model detect interactions between the two.

```
df["City"] + ", " + df["State"]
```

Similar tricks work with numeric data. Want to know if prices ending in $0.99 lead to more sales? You can split numbers into whole and fractional parts with the Numpy function `np.modf`. It returns a tuple, so we'll create the features accordingly.

```
df["Cents"], df["Dollars"] = np.modf(df["Price"])
```

# Group Transforms #

**Group transforms** aggregate information across multiple rows. If you had discovered a category interaction, a group transform over that categry could be something good to investigate.

It's common to do statistical aggregations group-wise. The first line computes the average income within each state. The second computes how different each person's income is from the average of the state they live in.
```
df["AverageIncome"] = df.groupby("State")["Income"].transform("mean")
df["DiffAvgIncome"] = df["Income"] - df["AverageIncome"]
```

Somewhat similar is computing the *frequency* of a feature's values. This would give the frequency of each state in the dataset.
```
df["StateFreq"] = df.groupby('State')['State'].transform('size') / len(df)
```
You could use this as an alternative to a one-hot or label encoding for a categorical feature.

If you're using training and validation splits, it's best to create a grouped feature using only the training set and then join it to the validation set. Using the entire set will create a dependency between the splits, which can throw off cross-validation.

This will join an `AverageIncome` feature created on a training set `df_train` to the validation set `df_valid`. (There should be, of course, only one average income per state.)
```
df_valid = pd.merge(
    df_valid,
    df_train[["State", "AverageIncome"]].drop_duplicates(),
    on="salary",
    how="left",
)
```

Group transforms are easiest to use when the grouping categories are known and relatively frequent. 

*Unknown categories*: Pandas will create missing values (`NaN`) for categories that aren't in the training set when you perform the `merge`. You could fill them in with the ungrouped transform.
```

```

*Rare categories*: A mean computed from only two or three values isn't likely to be very accurate. Consider 

```

```

# Example - 1985 Automobiles #

Let's develop the features we chose in the last lesson. Here they are as a reminder.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("../input/fe-course-data/autos.csv")
df.head()

top_features = [
    "curb_weight",
    "horsepower",
    "highway_mpg",
    "city_mpg",
    "width",
    "length",
    "wheel_base",
    "bore",
    "fuel_system",
]
print(top_features)

Let's do a drill-down and create our first set of derived features.

It can help to start with a bit of research.

The "displacement" of an automotive engine is a measure of its power.

In [None]:
df["displacement"] = (
    np.pi * ((0.5 * df.bore) ** 2) * df.stroke * df.num_of_cylinders
)

Similarly, the "stroke ratio" is a measure of how efficient an engine is vs how performant.

In [None]:
df["stroke_ratio"] = df.stroke / df.bore

Insurance companies sometimes define the "size class" of a vehicle in terms of its shadow and curb_weight.

In [None]:
df['shadow'] = df.length * df.width
df['size_class'] = df.shadow * df.curb_weight

A vehicle's "wheel base" can determine the smoothness of its ride. Luxury versions sometimes have an extended wheel base.

In [None]:
df["wheel_base_diff"] = df["wheel_base"] - df.groupby("make")[
    "wheel_base"
].transform("median")

The EPA defines the "combined fuel economy" of a vehicle as a weighted average of highway and city fuel economies.

In [None]:
df['ave_mpg'] = 0.55 * df.highway_mpg + 0.45 * df.city_mpg

Combining features with common units can be fruitful.

In [None]:
df['mpg_ratio'] = df.highway_mpg / df.city_mpg
df['volume'] = df.length * df.width * df.height

Looking up the acronyms in the `fuel_system` feature we see they fall into categories, which we can decompose.

In [None]:
df.fuel_system

Sometimes a feature can be suprisingly *unimportant*. When that's the case, it's worth investigating why. For categoricals, it could be that one class predominates in the data -- it just doesn't contain much information. It could be worth dropping.

It's unlikely that all of these features will end up being important. But that's usually how it is.

Get familiar with your data, do some research, and you'll often be able to come up with features that improve your dataset quite a lot.

# Keep Going #