# Feature Selection & Feature Engineering
### What is feature engineering and why do we need it?
- Creating new features (columns) from the originally available data or from auxiliary data sources.
- To address data quality problem. i.e. data cleaning
- To change the representation of data in ways that are thought to be improve learning.

### Data Quality Problems: What are them and how do we deal with them?

### 1. Disparate Scales
**Outcome**: Get a biased models, inaccurate results, and unstable parameter estimates

**Solutions**: Standardization

In [1]:
# import
import pandas as pd
import numpy as np

In [2]:
# Create a sample data set in disparate scale
sample_df = pd.DataFrame({'v1': pd.Series(np.random.choice(1000,20)),
                          'v2': pd.Series(np.random.choice(20,20))})
sample_df.head()

Unnamed: 0,v1,v2
0,569,15
1,535,17
2,356,5
3,538,2
4,419,6


In [3]:
# Standardize
sample_df_std = sample_df.copy()  #In order to run this cell many times w/o error

# Apply z-score formula to new column
for col_name in sample_df.columns:
    new_col = col_name + '_std'
    sample_df_std[new_col] = (sample_df[col_name] - sample_df[col_name].mean()) / sample_df[col_name].std()

sample_df_std.head()

Unnamed: 0,v1,v2,v1_std,v2_std
0,569,15,0.58054,0.84149
1,535,17,0.46192,1.139362
2,356,5,-0.162579,-0.647873
3,538,2,0.472387,-1.094682
4,419,6,0.057217,-0.498936


### 2. Noise & Outliers
**Solutions 1**: Discretization

- Changing a numeric feature into an ordinal or nominal categorical features based on value ranges, also referred to as binning.
- Commonly used in linear models to increase flexibility.
- Smoothes complex signals in training data. **Decrease overfitting**.
- Deals with [missing values](#3.-Missing-Values:-Uncollected-Information) or outliers.

In [4]:
# import
import pandas as pd
import numpy as np

In [5]:
# Create a sample data set
sample_df = pd.DataFrame({'v1': pd.Series(np.random.randn(20))})

sample_df.head()

Unnamed: 0,v1
0,2.407751
1,-0.273877
2,-0.571563
3,1.183492
4,1.203606


In [6]:
# Discretize (Creating Bins)
sample_df['v1_discret'] = pd.DataFrame(pd.cut(sample_df['v1'], 5))

sample_df.head(8)

Unnamed: 0,v1,v1_discret
0,2.407751,"(1.528, 2.408]"
1,-0.273877,"(-1.112, -0.232]"
2,-0.571563,"(-1.112, -0.232]"
3,1.183492,"(0.648, 1.528]"
4,1.203606,"(0.648, 1.528]"
5,0.4218,"(-0.232, 0.648]"
6,-0.727621,"(-1.112, -0.232]"
7,-1.500019,"(-1.996, -1.112]"


**Solution 2**: Winsorizing

- Removing outliers in a feature's value and replacing them with more central values of that feature.

In [7]:
# Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize

In [8]:
# Create a sample data set
sample_df = pd.DataFrame({'v1': pd.Series(np.random.choice(1000,20))})

sample_df

Unnamed: 0,v1
0,53
1,90
2,717
3,328
4,396
5,500
6,921
7,524
8,892
9,489


In [9]:
# Winsorize
sample_df['v1_winsorized'] = winsorize(sample_df['v1'], limits = [0.1, 0.1])
sample_df

Unnamed: 0,v1,v1_winsorized
0,53,90
1,90,90
2,717,717
3,328,328
4,396,396
5,500,500
6,921,915
7,524,524
8,892,892
9,489,489


### 3. Missing Values: Uncollected Information