# Understanding Features
#### Definitions:
**Feature**: A specific representation on top of *raw data*. An individual, measurable attribute, typically depicted by a column in a dataset. Each *observation* is depicted by a *row* and each *feature* by a *column*
<br>
<br>
Features can be of two major types based on a dataset:
<br>
* **Raw features**: obtained directly from the dataset with no extra data manipulation or engineering.
* **Derived features**: usually obtained from feature engineering, where we extract features from existing data attributes. i.e. creating new feature *Age* from employee data set containing *Birthdate*.
<br>
<br> 

## Feature Engineering on Numeric Data
Numeric data typically represents data in the form of scalar values depicting observations, recordings and measurements. It can also be represented as a vector of values where each value or entity can represent a specific feature.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats
%matplotlib inline

#### Binarization 
We use binarization when we are more concerned with a feature happening or not, versus the frequency(count) with which it happened. You can convert counted features into a new binary feature with a similar syntax to this:
```
binary_label = np.array(df['feature_label']) 
binary_label[binary_label >= 1] = 1
popsong_df['binary_label'] = binary_label
```
<br>
You can also use **scikit-learn's *Binarizer*** class here from it's *preprocessing* module to perform the same task instead of *NumPy* arrays.

```
from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold=0.9)
binary_label = bn.transform([df['feature_label']])[0]
df['binary_label'] = bionary_label
df.head(11)
```

#### Rounding
Often, when dealing with continuous numeric attributes like proportions or percentages, we may not need the raw values having a high amount of precision. Hence it often makes sense to round off these high precision percentages into numeric integers. These integers can then be directly used as raw values or even as categorical(discrete-class based) features. A good example would be converting a popularity percentage in both a scale out of 10, and a scale out of 100:
```
items_popularity = pd.read_csv('datasets/item_popularity.csv',  
                               encoding='utf-8')
                               
# Convert to scale out of 10
items_popularity['popularity_scale_10'] = np.array(               np.round((items_popularity['pop_percent'] * 10)),         dtype='int')

# Convert to scale out of 100
items_popularity['popularity_scale_100'] = np.array(
        np.round((items_popularity['pop_percent'] * 100)),    
        dtype='int')
```

#### Interactions
Supervised ML models try to model the output responses (discrete classes or continuous values) as a function of the input feature variables. 
<br>Simple linear regression equation:
<br>![image.png](attachment:4a55f1c6-49fc-4732-91cd-64aa111bc8dd.png)
<br>Input features depicted by variables:
<br>![image.png](attachment:ab3af7dc-60a1-48f1-847b-1d578fdac51e.png)
<br>having weights or coefficients denoted by:
<br>![image.png](attachment:e5aa86b2-067b-439b-81b8-71ccc319ab1d.png)
<br>respectively and the goal is to predict the response y.