# Feature Engineering

Feature Engineering is the process of using domain knowledge of the data to create features or variables to use in machine learning.

## Feature Engineering Techniques
* Missing Data Imputation
* Categorical Variable Encoding
* Variable Transformation
* Creating New Features

## Datasets

### Titanic Dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [3]:
df = df.replace('?',np.nan)
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [4]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan

In [5]:
df['cabin'] = df['cabin'].apply(get_first_cabin)

In [6]:
df['cabin']

0        B5
1       C22
2       C22
3       C22
4       C22
       ... 
1304    NaN
1305    NaN
1306    NaN
1307    NaN
1308    NaN
Name: cabin, Length: 1309, dtype: object

In [7]:
df.to_csv('data/titanic.csv',index=False)

### Credit Approval UCI data

In [8]:
df = pd.read_csv('data/crx.data', header = None)

In [9]:
varnames = ['A'+str(s) for s in range(1,17)]

In [13]:
df.columns = varnames

In [None]:
df = df.replace('?', np.nan)

In [None]:
df['A2'] = df['A2'].astype('float')
df['A14'] = df['A14'].astype('float')
df['A16'] = df['A16'].map({'+':1, '-':0})

df.head()

In [None]:
random.seed(9001)

values = set([random.randint(0, len(df)) for p in range(0, 100)])

for var in ['A3', 'A8', 'A9', 'A10']:
    df.loc[values, var] = np.nan
    
    
df.isnull().sum()

In [None]:
df.to_csv('data/creditApprovalUCI.csv', index=False)